Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is Newspaper3k used for html scraping? #5

Closed
tilmanrpk opened this issue Mar 6, 2019 · 6 comments
Closed

Why is Newspaper3k used for html scraping? #5

tilmanrpk opened this issue Mar 6, 2019 · 6 comments

Comments

@tilmanrpk
Copy link
Contributor

tilmanrpk commented Mar 6, 2019

I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?

@jcpeterson
Copy link
Owner

jcpeterson commented Mar 7, 2019

Why was this closed? We could certainly do something simpler. Newspaper uses requests. Before, we were using urllib to make sure headers were HTML (not jpg files etc) and download but it was much slower. Using article.parse() is one way to prevent saving non-html files. Other ideas that would be faster?

@jcpeterson jcpeterson reopened this Mar 7, 2019
@tilmanrpk
Copy link
Contributor Author

Ok I just made a little prototype based off pycurl (C based libcurl wrapper). I didn't do a real benchmark yet but it seems to be a lot faster than the newspaper approach. I'm just now facing a little issue with timeout (some websites seem to use protection against scrapers, slowing them down and triggering timeouts, also there seems to be an issue with timeout set from command line, as a temporary fix I just used pycurl timeout option).

It no longer parses through newspaper checking for text, but it filters out all content types which are not 'text/html' so apart from some edge cases it should be alright.

There needs to be some more testing but I was able to fetch 10000 links on a Google Cloud 8 threaded instance in about 170 seconds, around 9000 sites yielded content.

My modification replaces the raw scraper and is fully compatible with the rest of the code.

I forked your repository so you can look at the changes here: https://github.com/TGithubbr/openwebtext

Note that this is only a prototype and there might be some ugly things in there.

Thanks for your great work

@jcpeterson
Copy link
Owner

Thanks for testing this. Any updates? Still getting timeouts?

@tilmanrpk
Copy link
Contributor Author

Yeah. See my pull request. Multiprocessing is not able to stop processes after timeout. The only thing timeout does in this case is that the next method throws an exception after a given timeout. The process keeps running forever. As you can see in my pull request I replaced multiprocessing with pebble which actually stops processes after timeout.
My current implementation of pycurl scraper (see my repository) performs very well and only a small percentage of workers isn't stopped on time with very high process count. They are still stopped by the pycurl native timeout though which I set to 20 seconds. Some specific sites (goal.com and washingtonpost.com in particular) are timing out more often then others which I believe is due to a built-in slow down when getting a lot of requests. There is not a lot that one could do about this though.
I can put just the pycurl scraper as pull request if you want. In my fork I also added a fast multicore archiver.

@jcpeterson
Copy link
Owner

jcpeterson commented Apr 22, 2019

@TGithubbr can you add it as an option?

@jcpeterson
Copy link
Owner

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants