Add DOWNLOAD_DELAY=0.5 to Scrapy config #109

paulproteus · 2015-01-04T18:42:40Z

At the time of writing, oh-bugimporters has difficulty downloading
all the bugs it wants to from github.com.

@ehashman discovered that GitHub throttles API requests after
5000 per hour.

The Scrapy DOWNLOAD_DELAY setting affects only "consecutive pages
from the same website", so we should still see a sizeable amount
of parallelism in our crawling after this change. However,
since this setting applies to all domains, we might still
see a general slowdown.

@ehashman

At the time of writing, oh-bugimporters has difficulty downloading all the bugs it wants to from github.com. @ehashman discovered that GitHub throttles API requests after 5000 per hour. The Scrapy DOWNLOAD_DELAY setting affects only "consecutive pages from the same website", so we should still see a sizeable amount of parallelism in our crawling after this change. However, since this setting applies to all domains, we might still see a general slowdown.

ehashman · 2015-01-04T18:56:00Z

Looks good to me.

Add DOWNLOAD_DELAY=0.5 to Scrapy config

paulproteus mentioned this pull request Jan 4, 2015

GitHub download code needs to be nicer to github.com's servers #110

Closed

ehashman added a commit that referenced this pull request Jan 4, 2015

Merge pull request #109 from paulproteus/bugfix/github-slow-self-down

9cfaae9

Add DOWNLOAD_DELAY=0.5 to Scrapy config

ehashman merged commit 9cfaae9 into openhatch:master Jan 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DOWNLOAD_DELAY=0.5 to Scrapy config #109

Add DOWNLOAD_DELAY=0.5 to Scrapy config #109

paulproteus commented Jan 4, 2015

ehashman commented Jan 4, 2015

Add DOWNLOAD_DELAY=0.5 to Scrapy config #109

Add DOWNLOAD_DELAY=0.5 to Scrapy config #109

Conversation

paulproteus commented Jan 4, 2015

ehashman commented Jan 4, 2015