New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is Newspaper3k used for html scraping? #5
Comments
Why was this closed? We could certainly do something simpler. Newspaper uses requests. Before, we were using urllib to make sure headers were HTML (not jpg files etc) and download but it was much slower. Using article.parse() is one way to prevent saving non-html files. Other ideas that would be faster? |
Ok I just made a little prototype based off pycurl (C based libcurl wrapper). I didn't do a real benchmark yet but it seems to be a lot faster than the newspaper approach. I'm just now facing a little issue with timeout (some websites seem to use protection against scrapers, slowing them down and triggering timeouts, also there seems to be an issue with timeout set from command line, as a temporary fix I just used pycurl timeout option). It no longer parses through newspaper checking for text, but it filters out all content types which are not 'text/html' so apart from some edge cases it should be alright. There needs to be some more testing but I was able to fetch 10000 links on a Google Cloud 8 threaded instance in about 170 seconds, around 9000 sites yielded content. My modification replaces the raw scraper and is fully compatible with the rest of the code. I forked your repository so you can look at the changes here: https://github.com/TGithubbr/openwebtext Note that this is only a prototype and there might be some ugly things in there. Thanks for your great work |
Thanks for testing this. Any updates? Still getting timeouts? |
Yeah. See my pull request. Multiprocessing is not able to stop processes after timeout. The only thing timeout does in this case is that the next method throws an exception after a given timeout. The process keeps running forever. As you can see in my pull request I replaced multiprocessing with pebble which actually stops processes after timeout. |
@TGithubbr can you add it as an option? |
merged |
I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?
The text was updated successfully, but these errors were encountered: