Why is Newspaper3k used for html scraping? #5

tilmanrpk · 2019-03-06T22:43:11Z

I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?

jcpeterson · 2019-03-07T15:55:07Z

Why was this closed? We could certainly do something simpler. Newspaper uses requests. Before, we were using urllib to make sure headers were HTML (not jpg files etc) and download but it was much slower. Using article.parse() is one way to prevent saving non-html files. Other ideas that would be faster?

tilmanrpk · 2019-03-10T00:29:31Z

Ok I just made a little prototype based off pycurl (C based libcurl wrapper). I didn't do a real benchmark yet but it seems to be a lot faster than the newspaper approach. I'm just now facing a little issue with timeout (some websites seem to use protection against scrapers, slowing them down and triggering timeouts, also there seems to be an issue with timeout set from command line, as a temporary fix I just used pycurl timeout option).

It no longer parses through newspaper checking for text, but it filters out all content types which are not 'text/html' so apart from some edge cases it should be alright.

There needs to be some more testing but I was able to fetch 10000 links on a Google Cloud 8 threaded instance in about 170 seconds, around 9000 sites yielded content.

My modification replaces the raw scraper and is fully compatible with the rest of the code.

I forked your repository so you can look at the changes here: https://github.com/TGithubbr/openwebtext

Note that this is only a prototype and there might be some ugly things in there.

Thanks for your great work

jcpeterson · 2019-03-17T22:52:18Z

Thanks for testing this. Any updates? Still getting timeouts?

tilmanrpk · 2019-03-18T15:19:41Z

Yeah. See my pull request. Multiprocessing is not able to stop processes after timeout. The only thing timeout does in this case is that the next method throws an exception after a given timeout. The process keeps running forever. As you can see in my pull request I replaced multiprocessing with pebble which actually stops processes after timeout.
My current implementation of pycurl scraper (see my repository) performs very well and only a small percentage of workers isn't stopped on time with very high process count. They are still stopped by the pycurl native timeout though which I set to 20 seconds. Some specific sites (goal.com and washingtonpost.com in particular) are timing out more often then others which I believe is due to a built-in slow down when getting a lot of requests. There is not a lot that one could do about this though.
I can put just the pycurl scraper as pull request if you want. In my fork I also added a fast multicore archiver.

jcpeterson · 2019-04-22T22:13:32Z

@TGithubbr can you add it as an option?

jcpeterson · 2020-03-18T19:16:20Z

merged

tilmanrpk closed this as completed Mar 7, 2019

jcpeterson reopened this Mar 7, 2019

jcpeterson closed this as completed Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Newspaper3k used for html scraping? #5

Why is Newspaper3k used for html scraping? #5

tilmanrpk commented Mar 6, 2019 •

edited

jcpeterson commented Mar 7, 2019 •

edited

tilmanrpk commented Mar 10, 2019

jcpeterson commented Mar 17, 2019

tilmanrpk commented Mar 18, 2019

jcpeterson commented Apr 22, 2019 •

edited

jcpeterson commented Mar 18, 2020

Why is Newspaper3k used for html scraping? #5

Why is Newspaper3k used for html scraping? #5

Comments

tilmanrpk commented Mar 6, 2019 • edited

jcpeterson commented Mar 7, 2019 • edited

tilmanrpk commented Mar 10, 2019

jcpeterson commented Mar 17, 2019

tilmanrpk commented Mar 18, 2019

jcpeterson commented Apr 22, 2019 • edited

jcpeterson commented Mar 18, 2020

tilmanrpk commented Mar 6, 2019 •

edited

jcpeterson commented Mar 7, 2019 •

edited

jcpeterson commented Apr 22, 2019 •

edited