Why Doesn't Wgit Crawl In Parallel?

Many web crawlers boast parallelism out of the box, by sending multiple HTTP requests at once. Wgit doesn't work like this, everything in Wgit is, by default, performed in sequence, mainly for simplicity and better predictability of results.

Since Wgit is a library however, it is possible to call its functionality inside parallel constructs such as threads etc. But to do so, you need to know the threads ahead of time e.g. urls.each { Thread.new { crawler.crawl(url) } }.map(&:join) etc. Wgit is thread safe for this very reason.

Wgit's Crawler#crawl_site and Indexer#index_site methods will crawl all internal links within the site's host in sequence - the same order that they're found and parsed from the HTML. This is deliberate to ensure the crawls are easy to understand and track. But also because, during benchmarking of parallelism using the async gem, it was found that the speed increase was modest to non existent.

Benchmarking of Wgit and the async gem was tested at various levels:

Wgit::Crawler#crawl_site
Wgit::Crawler#crawl_urls
Wgit::Indexer#index_site

In all experiments, it was found to have minimal positive impact on performance of crawling. And the added downside was that crawling in parallel makes it less deterministic overall. The price to pay is in no way worth the gain (since there is no gain).

The main reasons why a performance improvement wasn't noticed is:

Servers typically rate limit requests, especially from crawlers, so crawling in parallel wasn't faster overall.
Bottlenecks elsewhere, which aren't addressed by crawling in parallel.

What does make a difference to overall crawl performace of sites is:

DNS lookup caching
TCP/TLS connection re-use (avoiding new handshakes that require additional round trips to the server)

Both of these factors are already being utilised by Wgit's networking requests, using libcurl under the hood. Overall, these optimisations are much more valuable than parallel crawling, and have no downsides to boot.

Therefore, Wgit has no future intentions of crawling in parallel. It's great as is :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Doesn't Wgit Crawl In Parallel?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Initial Docs

How To's

Clone this wiki locally