Skip to content

Developing Alternative Frontier Implementations

Andy Jackson edited this page Mar 18, 2021 · 2 revisions

Motivation

The current BdbFrontier implementation is the only well-used Frontier implementation. It works but there are some limitations:

  1. Checkpoint-only crawl state management: the crawl can only be resumed from a checkpoint, rather than it being possible to directly restart from a crawl that could not be stopped 'neatly' (e.g. VM death, system failure, etc.)
  2. Opaque: the state is stored as key-values where the values are Kryo-serialised CrawlURI instance blobs. This means only the right versions of the H3 codebase are able to look inside the frontier, and upgrades to H3 can render the state unusable. The Frontier cannot be analysed from other languages.
  3. Single-process locked: the state can only be inspected or modified from the running H3 instance itself. We can't use external tools to manage the frontier or use multiple crawlers over the same Frontier (even if they are in Java, as per point (2)).
  4. Bloated: the state can take up a LOT of space, as can the regular checkpoints you need to take and manage because of (1). The space used is partially down to BDB not being that efficient, but also due to (2) which means that it's possible to end up with large Kryo blobs (because of all the ways it's possible to embed additional data or objects inside the H3 CrawlURI object).
  5. Complex: the BdBFrontier does a lot of different-but-related things and is very difficult to understand and reason about. It also implements a sophisticated caching system where the in-memory Frontier is updated 'live' and occasionally flushed to disk.
  6. Concurrency: the BdBFrontier is intended to be run with many ToeThreads, but the complexities of state management mean a number of locks and synchronisations have been put in place to attempt to ensure the state is valid and consistent. These appear to have led to the codebase being overly cautious in places, which means as the number of ToeThreads is increased, the BdbFrontier lock contention becomes a significant bottleneck. This gets worse for larger crawls, due to the interactions with the in-memory caching/flushing system. (ANJ: I haven't got quite to the bottom of all this, but have observed that crawlers with ample memory and CPU available do not manage to use those resources, despite there being no observable I/O wait. This is consistent with some kind of lock contention, but I've not proven exactly where the problem arises).

Responsibilities

The problem with the Heritrix3 frontier is that it does a lot of related things. The base Frontier class looks simple enough, accepting CrawlURIs with a QueueKey and a priority, which it stores, and then releases at crawl time. i.e. as well as storing the URIs in queues, it is aware of the crawl delays and thus the time when each queue is due. However, as hinted at by the AbstractFrontier, there are other things going on. In fact, there are four layers of inheritance, each bringing additional functionality:

In general, much of the complexity arises from implementation details relating to the BdbFrontier design, so it probably makes more sense to implement any new frontier at the highest level (implements Frontier) and only implement the functionality that is needed. However, the BdbFrontier has been around a long time and it's possible other parts of H3 have become somewhat hard-coded against it's capabilities. If this is true, this will likely only become clear during testing. This is becase using things like Spring means lots of dependencies are dynamically pulled in at runtime rather than being declared at compile time.

Looking at the AbstractFrontier we see:

public abstract class AbstractFrontier 
    implements Frontier,
               SeedListener, 
               HasKeyedProperties,
               ExtractorParameters,
               CrawlUriReceiver,
               ApplicationListener<ApplicationEvent> {

For basic crawl functionality, it is likely that any implementation will need to implement SeedListener (if H3 is being used to manage seeds), and implement CrawlUriReceiver (if the URI uniqueness is not part of the Frontier implementation). However, for a externally-managed Frontier, it would be possible to handle seeds and uniqueness entirely outside of H3 and just use H3 as a crawl pool that gets what it's told to get and writes the WARCs/logs. This depends on how crawl scope is managed (which is often handled via addional SeedListeners that are part of the Scope DecideRules).

The other features would likely only be implemented if necessary. e.g. ApplicationListener could be used to track crawl events, and this could be use to drop/reconnect to an external database on pause/unpause (rather than the connection having to be there all the time). HasKeyedProperties refers to H3's 'sheets' configuration system, which allows some Frontier-level configuration to be controlled, e.g. the ExtractorParameters which covers things like maximum outlink extraction.

Testing

Assuming we have a Frontier implementation that covers seed injection and URI uniqueness, it should be possible to design a set of Crawler Beans that use it and can run a test crawl.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally