Skip to content
Andy Jackson edited this page Sep 14, 2018 · 1 revision

At the job level, a Heritrix crawl job has three main pipelines, known as Processor Chains (sequential application of swappable Processor modules -- see Processor Settings), with the Frontier acting as a buffer between the first two:

  • The Candidates Chain:
    • This processing incoming Crawl URIs, deciding whether to keep them (according to the Scope), and priming them to be deposited in the Frontier.
    • See Candidate Chain Processors
  • The Frontier:
    • Crawl URIs accepted into this crawl are stored here in priority order, in a set of distinct queues.
    • Usually, there is one queue per 'authority' (e.g. example.com:80), and the queue management ensures the desired crawl delay is honoured for each queue.
    • See Frontier
  • The Fetch Chain:
    • As Crawl URIs are emitted by the Frontier, the fetch chain processes each one and decides what to do with it, how to download it, etc.
    • This chain also performs operations like link extraction.
    • See Fetch Chain Processors
  • The Disposition Chain:
    • One the Fetch Chain has finished, any required post-processing is handled here.
    • For example, this is where the downloaded resources are written into WARC files.
    • See Disposition Chain Processors

Each URI taken off the Frontier queue runs through the processing chains.  URIs are always processed in the order shown in the diagram below, unless a particular processor throws a fatal error or decides to stop the processing of the current URI. 

Each processing chain is made up of zero or more individual processors.  For example, the FetchChain might comprise the extractorCss and extractorJs processors.  Within a processing step, the order in which the processors are run is the order in which they are listed in the crawler-beans.cxml file.

Attachments:

HeritrixProcessorChains.png (image/png)

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally