Skip to content

Commit

Permalink
docs: add info on index formats
Browse files Browse the repository at this point in the history
  • Loading branch information
ikreymer committed Oct 17, 2017
1 parent bfbada1 commit f01fa50
Show file tree
Hide file tree
Showing 3 changed files with 82 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/manual/cdxserver_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ Pagination API
^^^^^^^^^^^^^^

The cdx server supports an optional pagination api, but it is currently
only available when using `ZipNum Compressed Index`_ instead of a plain
only available when using :ref:`zipnum` instead of a plain
text cdx files. (Additional pagination support may be added for CDXJ
files as well).

Expand Down
74 changes: 72 additions & 2 deletions docs/manual/indexing.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,72 @@
pywb Indexing
=============
Indexing
========

To provide access to the web archival data (local and remote), pywb uses indexes to represent each "capture" or "memento" in the archive. The WARC format itself does not provide a specific index, so an external index is needed.

Creating an Index
-----------------

When adding a WARC using ``wb-manager``, pywb automatically generates a :ref:`cdxj-index`

The index can also be created explicitly using ``cdx-indexer`` command line tool::

cdx-indexer -j example2.warc.gz
com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}
Note: the cdx-indexer tool is deprecated and will be replaced by the standalone `cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`_ package.


Index Formats
-------------

Classic CDX
^^^^^^^^^^^

Traditionally, an index for a web archive (WARC or ARC) file has been called a CDX file, probably from Capture/Crawl inDeX (CDX).

The CDX format originates with the Internet Archive and represents a plain-text space-delimited format, each line representing the information about a single capture. The CDX format could contain many different fields, and unfortunately, no standardized format existed.
The order of the fields typically includes a searchable url key and timestamp, to allow for binary sorting and search.
The 'url search key' is typically reversed and to allow for easier searching of subdomains, eg. ``example.com`` -> ``com,example,)/``

A classic CDX file might look like this::

CDX N b a m s k r M S V g
com,example)/ 20160225042329 http://example.com/ text/html 200 37cf167c2672a4a64af901d9484e75eee0e2c98a - - 1286 363 example2.warc.gz

A header is used to index the fields in the file, though typically a standard variation is used.

.. _cdxj-index:

CDXJ Format
^^^^^^^^^^^

The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary::

com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"}

The CDXJ format allows for more flexibility by allowing the index to contain a varying number of fields, while still allow the index to be sortable by a common key (url key + timestamp). This allows CDXJ indexes from different sources and different number of fields to be merged and sorted.

Using CDXJ indexes is recommended and pywb provides the ``wb-manager migrate-cdx`` tool for converting classic CDX to CDXJ.

In general, most discussions of CDX also apply to CDXJ indexes.

.. _zipnum:

ZipNum Sharded Index
^^^^^^^^^^^^^^^^^^^^

A CDX(J) file is generally accessed by doing a simple binary search through the file. This scales well to very large (GB+) CDXJ files. However, for very large archives (TB+ or PB+), binary search across a single file has its limits.

A more scalable alternative to a single CDX(J) file is gzip compressed chunked cluster of CDXJ, with a binary searchable index.
In this format, sometimes called the *ZipNum* or *Ziplines cluster* (for some X number of cdx lines zipped together), all actual CDXJ lines are gzipped compressed an concatenated together. To allow for random access, the lines are gzipped in groups of X lines (often 3000, but can be anything). This allows for the full index to be spread over N number of gzipped files, but has the overhead of requiring N lines to be read for each lookup. Generally, this overhead is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines.

The index can be split into an arbitrary number of shards, each containing a certain range of the url space. This allows the index to be created in parallel using MapReduce with a reduce task per shard. For each shard, there is an index file and a secondary index file. At the end, the secondary index is concatenated to form the final, binary searchable index.

The `webarchive-indexing <https://github.com/ikreymer/webarchive-indexing>`_ project provides tools for creating such an index, both locally and via MapReduce.

Single-Shard Index
""""""""""""""""""

A ZipNum index need not have multiple shards, and provides advantages even for smaller datasets. For example, in addition to less disk space from using compressed index, using the ZipNum index allows for the :ref:`pagination-api` to be available when using the cdx server for bulk querying.


18 changes: 9 additions & 9 deletions docs/manual/warcserver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ index data was found), a 404 is returned.
WARC Record HTTP Response
"""""""""""""""""""""""""

When using Warcserver, the entire **WARC Record** is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.
When using Warcserver, the entire *WARC record* is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.

The following example illustrates what is transmitted when retrieving ``curl http://localhost:8070/pywb/index?url=iana.org``::
The following example illustrates what is transmitted when retrieving ``curl``-ing ``http://localhost:8070/pywb/index?url=iana.org``::

> GET /pywb/resource?url=iana.org HTTP/1.1
> Host: localhost:8070
Expand All @@ -110,11 +110,11 @@ The following example illustrates what is transmitted when retrieving ``curl htt
< Warcserver-Type: warc
< Date: Tue, 17 Oct 2017 00:32:12 GMT

WARC/1.0
WARC-Type: response
WARC-Date: 2014-01-26T20:06:24Z
WARC-Target-URI: http://www.iana.org/
WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
< WARC/1.0
< WARC-Type: response
< WARC-Date: 2014-01-26T20:06:24Z
< WARC-Target-URI: http://www.iana.org/
< WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
...

The HTTP payload is the WARC record itself but HTTP headers returned "surface" additional information
Expand Down Expand Up @@ -156,7 +156,7 @@ The sources include:
* CDX Server API Endpoint


The index types can be defined using either shorthand **sourcename+<url>** notation or a long-form full property declaration
The index types can be defined using either shorthand *sourcename+<url>* notation or a long-form full property declaration

The following is an example of defining different special collections::

Expand Down Expand Up @@ -281,7 +281,7 @@ It should be easy to add a custom index source, by extending :class:`pywb.warcse
You can then use the index in a ``config.yaml``::

collections:
my-coll: my-cool-index
my-coll: my-index-src

For more information and definition of existing indexes, see :mod:`pywb.warcserver.index.indexsource`
Expand Down

0 comments on commit f01fa50

Please sign in to comment.