Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling pages where json-ld is inserted via Javascript #7

Open
justinccdev opened this issue Feb 14, 2018 · 6 comments
Open

Crawling pages where json-ld is inserted via Javascript #7

justinccdev opened this issue Feb 14, 2018 · 6 comments

Comments

@justinccdev
Copy link
Member

We can't do this with a simple http fetch. Another case where perhaps we really have to use Nutch, though haven't looked yet to see if it has that kind of facility (I expect it has). Need to research.

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Feb 23, 2018

Can you give an example of JS rendered page? Maybe I can try selenium first... or the state of the art headless chrome maybe a good idea.

@XiangpengHao
Copy link
Contributor

If these pages get json-ld from an ajax call, we can probably figure out the Ajax API and save our life by directly call the API :)
The worst case is they simply insert everything from javascript, then we need to render the website on the local machine, which is messy and slow.

@justinccdev
Copy link
Member Author

Hi @HaoPatrick. I don't know of any examples of people doing this yet, though I might find out more at a meeting in the middle of this month if this is going to be an important near-term consideration. This is more a placeholder issue.

However, regarding calling an API directly, I would not want to do this since then you need to adapt your logic to each website's own way of doing this. The important thing about Bioschemas is that it is a generic mechanism for getting structured information.

@XiangpengHao
Copy link
Contributor

Oh, ok. I'll investigate on crawling packages these days and may come up with a new crawler that handles most of the conditions.

@XiangpengHao
Copy link
Contributor

I refactored the extractors a little so that it can crawl JS rendered website now, here is my branch.
The next step is to move the crawling logic to async to improve the performance

@justinccdev
Copy link
Member Author

Maybe you could create this as a PR that we won't merge to look at? I'm lazy to do this manually. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants