Common Index File Format
The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures. For more details, check out:
- Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. arXiv:2003.08276.
All data are contained in a single file, with the extension
The file comprises a sequence of delimited protobuf messages defined here, exactly as follows:
- Exactly the number of
PostingsListmessages specified in the
num_postings_listsfield of the
- Exactly the number of
DocRecordmessages specified in the
num_docsfield of the
See our design rationale for additional discussion.
Explained in terms of xkcd, we're trying to avoid this. Instead, CIFF aims to be this.
After cloning this repo, build CIFF with Maven:
mvn clean package appassembler:assemble
Reference Lucene Indexes
Currently, this repo provides an utility to export CIFF from Lucene, via Anserini. For reference, we provide exports from the Robust04 and ClueWeb12-B13 collections:
|Robust04||CIFF export, complete||162M||
|Robust04||CIFF export, queries only||16M||
|Robust04||Source Lucene index||135M||
|ClueWeb12-B13||CIFF export, complete||25G||
|ClueWeb12-B13||CIFF export, queries only||1.2G||
|ClueWeb12-B13||Source Lucene index||21G||
The follow invocation can be used to examine an export:
target/appassembler/bin/ReadCIFF -input robust04-complete-20200306.ciff.gz
We provide a full guide on how to replicate the above results here.
A CIFF export can be ingested into a number of different search systems.
- JASSv2 via the tool ciff_to_JASS.
- PISA via the PISA CIFF Binaries.
- OldDog by creating csv files through CIFF
- Terrier via the Terrier-CIFF plugin
Tips for writing your own CIFF Importer / Exporter
The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format. Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which should be noted.
The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers as deltas (d-gaps). Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion here.
Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the
DocRecordstructure are approximate - see the discussion here.
Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion here.