Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parse JSON from <script> of a page #103

Open
3 tasks
gildesmarais opened this issue Apr 7, 2021 · 2 comments
Open
3 tasks

feat: parse JSON from <script> of a page #103

gildesmarais opened this issue Apr 7, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@gildesmarais
Copy link
Member

gildesmarais commented Apr 7, 2021

Many pages include the interesting information in JSON in a script tag somewhere (e.g. to use it in a SPA).

Now, since every script tag looks the same and just the content matters, we'd need to select the correct once of many first. From there on, it should be the same process to generate the RSS.

  • select the correct script tag
  • handle global variable assignment in script-tag
  • support parsing json from <script>

Sometimes it's not JSON, but javascript (e.g. to assign JS objects to a global variable. Nuxt does that).

Javascript's JSON.stringify simply ignores non-json-able notations when serializing.

JSON.stringify({ "a": "Bbb", "b": function() { alert() }, "c": "d"})
=> '{"a":"Bbb","c":"d"}'

This behaviour is desirable for these cases.

@gildesmarais gildesmarais added the enhancement New feature or request label Apr 7, 2021
@nm2k
Copy link

nm2k commented May 25, 2021

+1

@gildesmarais
Copy link
Member Author

With auto_source, parsing Schema Articles works.

For scraping SPAs, #204 seems to be the less brittle approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog / No Prio
Development

No branches or pull requests

2 participants