feat: parse JSON from <script> of a page #103

gildesmarais · 2021-04-07T09:06:39Z

Many pages include the interesting information in JSON in a script tag somewhere (e.g. to use it in a SPA).

Now, since every script tag looks the same and just the content matters, we'd need to select the correct once of many first. From there on, it should be the same process to generate the RSS.

select the correct script tag
handle global variable assignment in script-tag
support parsing json from <script>

Sometimes it's not JSON, but javascript (e.g. to assign JS objects to a global variable. Nuxt does that).

Javascript's JSON.stringify simply ignores non-json-able notations when serializing.

JSON.stringify({ "a": "Bbb", "b": function() { alert() }, "c": "d"})
=> '{"a":"Bbb","c":"d"}'

This behaviour is desirable for these cases.

The text was updated successfully, but these errors were encountered:

nm2k · 2021-05-25T13:59:05Z

+1

gildesmarais · 2024-08-21T14:22:13Z

With auto_source, parsing Schema Articles works.

For scraping SPAs, #204 seems to be the less brittle approach.

gildesmarais added the enhancement New feature or request label Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parse JSON from <script> of a page #103

feat: parse JSON from <script> of a page #103

gildesmarais commented Apr 7, 2021 •

edited

Loading

nm2k commented May 25, 2021

gildesmarais commented Aug 21, 2024

feat: parse JSON from <script> of a page #103

feat: parse JSON from <script> of a page #103

Comments

gildesmarais commented Apr 7, 2021 • edited Loading

nm2k commented May 25, 2021

gildesmarais commented Aug 21, 2024

gildesmarais commented Apr 7, 2021 •

edited

Loading