-
Notifications
You must be signed in to change notification settings - Fork 3
Why Doesn't Wgit Crawl In Parallel?
Many web crawlers boast parallelism out of the box, by sending multiple HTTP requests at once. Wgit doesn't work like this, everything in Wgit is, by default, performed in sequence, mainly for simplicity and better predictability of results.
Since Wgit is a library however, it is possible to call its functionality inside parallel constructs such as threads etc. But to do so, you need to know the threads ahead of time e.g. urls.each { Thread.new { crawler.crawl(url) } }.map(&:join) etc. Wgit is thread safe for this very reason.
Wgit's Crawler#crawl_site and Indexer#index_site methods will crawl all internal links within the site's host in sequence - the same order that they're found and parsed from the HTML. This is deliberate to ensure the crawls are easy to understand and track. But also because, during benchmarking of parallelism using the async gem, it was found that the speed increase was modest to non existent.
Benchmarking of Wgit and the async gem was tested at various levels:
Wgit::Crawler#crawl_siteWgit::Crawler#crawl_urlsWgit::Indexer#index_site
In all experiments, it was found to have minimal positive impact on performance of crawling. And the added downside was that crawling in parallel makes it less deterministic overall. The price to pay is in no way worth the gain (since there is no gain).
The main reasons why a performance improvement wasn't noticed is:
- Servers typically rate limit requests, especially from crawlers, so crawling in parallel wasn't faster overall.
- Bottlenecks elsewhere, which aren't addressed by crawling in parallel.
What does make a difference to overall crawl performace of sites is:
- DNS lookup caching
- TCP/TLS connection re-use (avoiding new handshakes that require additional round trips to the server)
Both of these factors are already being utilised by Wgit's networking requests, using libcurl under the hood. Overall, these optimisations are much more valuable than parallel crawling, and have no downsides to boot.
Therefore, Wgit has no future intentions of crawling in parallel. It's great as is :-)
- How To Crawl A Website
- How To Crawl Authenticated Webpages
- How To Crawl In Parallel
- How To Crawl Locally
- How To Crawl More Than Just HTML
- How To Derive Crawl Statistics
- How To Extract Content
- How To Handle Redirects
- How To Index Content
- How To Parse A URL
- How To Parse Javascript
- How To Prevent Indexing
- How To Use A Database
- How To Use Last Response
- How To Use The DSL
- How To Use The Executable
- How To Use The Logger
- How To Write A Database Adapter