Web archiving using Google Chrome
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



Preservation for the modern web, powered by headless Google Chrome.


Quick start

These dependencies must be present to run crocoite:

The following commands clone the repository from GitHub, set up a virtual environment and install crocoite:

git clone https://github.com/PromyLOPh/crocoite.git
cd crocoite
virtualenv -p python3 sandbox
source sandbox/bin/activate
pip install .

One-shot command line interface and pywb playback:

pip install pywb
crocoite-grab http://example.com/ example.com.warc.gz
rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
wayback &
$BROWSER http://localhost:8080


Most modern websites depend heavily on executing code, usually JavaScript, on the user’s machine. They also make use of new and emerging Web technologies like HTML5, WebSockets, service workers and more. Even worse from the preservation point of view, they also require some form of user interaction to dynamically load more content (infinite scrolling, dynamic comment loading, etc).

The naive approach of fetching a HTML page, parsing it and extracting links to referenced resources therefore is not sufficient to create a faithful snapshot of these web applications. A full browser, capable of running scripts and providing modern Web API’s is absolutely required for this task. Thankfully Google Chrome runs without a display (headless mode) and can be controlled by external programs, allowing them to navigate and extract or inject data. This section describes the solutions crocoite offers and explains design decisions taken.

crocoite captures resources by listening to Chrome’s network events and requesting the response body using Network.getResponseBody. This approach has caveats: The original HTTP requests and responses, as sent over the wire, are not available. They are reconstructed from parsed data. The character encoding for text documents is changed to UTF-8. And the content body of HTTP redirects cannot be retrieved due to a race condition.

But at the same time it allows crocoite to rely on Chrome’s well-tested network stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as transport protocols like SSL and QUIC. Depending on Chrome also eliminates the need for a man-in-the-middle proxy, like warcprox, which has to decrypt SSL traffic and present a fake certificate to the browser in order to store the transmitted content.

WARC records generated by crocoite therefore are an abstract view on the resource they represent and not necessarily the data sent over the wire. A URL fetched with HTTP/2 for example will still result in a HTTP/1.1 request/response pair in the WARC file. This may be undesireable from an archivist’s point of view (“save the data exactly like we received it”). But this level of abstraction is inevitable when dealing with more than one protocol.

crocoite also interacts with and therefore alters the grabbed websites. It does so by injecting behavior scripts into the site. Typically these are written in JavaScript, because interacting with a page is easier this way. These scripts then perform different tasks: Extracting targets from visible hyperlinks, clicking buttons or scrolling the website to to load more content, as well as taking a static screenshot of <canvas> elements for the DOM snapshot (see below).

Replaying archived WARC’s can be quite challenging and might not be possible with current technology (or even at all):

  • Some sites request assets based on screen resolution, pixel ratio and supported image formats (webp). Replaying those with different parameters won’t work, since assets for those are missing. Example: missguided.com.
  • Some fetch different scripts based on user agent. Example: youtube.com.
  • Requests containing randomly generated JavaScript callback function names won’t work. Example: weather.com.
  • Range requests (Range: bytes=1-100) are captured as-is, making playback difficult

crocoite offers two methods to work around these issues. Firstly it can save a DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus <script> tags after the site has been fully loaded and thus can be displayed without executing scripts. Obviously JavaScript-based navigation does not work any more. Secondly it also saves a screenshot of the full page, so even if future browsers cannot render and display the stored HTML a fully rendered version of the website can be replayed instead.

Advanced usage

crocoite is built with the Unix philosophy (“do one thing and do it well”) in mind. Thus crocoite-grab can only save a single page. If you want recursion use crocoite-recursive, which follows hyperlinks according to --policy. It can either recurse a maximum number of levels or grab all pages with the same prefix as the start URL:

crocoite-recursive --policy prefix http://www.example.com/dir/ output

will save all pages in /dir/ and below to individual files in the output directory output. You can customize the command used to grab individual pages by appending it after output. This way distributed grabs (ssh to a different machine and execute the job there, queue the command with Slurm, …) are possible.

IRC bot

A simple IRC bot (“chromebot”) is provided with the command crocoite-irc. It reads its configuration from a config file like the example provided in contrib/chromebot.ini and supports the following commands:

a <url> -j <concurrency> -r <policy>
Archive <url> with <concurrency> processes according to recursion <policy>
s <uuid>
Get job status for <uuid>
r <uuid>
Revoke or abort running job with <uuid>

Browser configuration

Generally crocoite provides reasonable defaults for Google Chrome via its devtools module. When debugging this software it might be necessary to open a non-headless instance of the browser by running

google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs

and then passing the option --browser=http://localhost:9222 to crocoite-grab. This allows human intervention through the browser’s builtin console.

Another issue that might arise is related to fonts. Headless servers usually don’t have them installed by default and thus rendered screenshots may contain replacement characters (□) instead of the actual text. This affects mostly non-latin character sets. It is therefore recommended to install at least Micrsoft’s Corefonts as well as DejaVu, Liberation or a similar font family covering a wide range of character sets.

Related projects

Uses Google Chrome as well, but intercepts traffic using a proxy. Supports distributed crawling and immediate playback.
Communicates with headless Google Chrome and uses the Network API to retrieve requests like crocoite. Supports recursive crawls and page scrolling, but neither custom JavaScript nor distributed crawling.