Common Index File Format
The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures. For more details, check out:
- Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. arXiv:2003.08276.
- Exactly the number of
PostingsListmessages specified in the
num_postings_listsfield of the
- Exactly the number of
DocRecordmessages specified in the
num_docsfield of the
See our design rationale for additional discussion.
After cloning this repo, build CIFF with Maven:
mvn clean package appassembler:assemble
Reference Lucene Indexes
|Robust04||CIFF export, complete||162M||
|Robust04||CIFF export, queries only||16M||
|Robust04||Source Lucene index||135M||
|ClueWeb12-B13||CIFF export, complete||25G||
|ClueWeb12-B13||CIFF export, queries only||1.2G||
|ClueWeb12-B13||Source Lucene index||21G||
The follow invocation can be used to examine an export:
target/appassembler/bin/ReadCIFF -input robust04-complete-20200306.ciff.gz
We provide a full guide on how to replicate the above results here.
A CIFF export can be ingested into a number of different search systems.
- JASSv2 via the tool ciff_to_JASS.
- PISA via the PISA CIFF Binaries.
- OldDog by creating csv files through CIFF
- Terrier via the Terrier-CIFF plugin
Tips for writing your own CIFF Importer / Exporter
The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format. Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which should be noted.
The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers as deltas (d-gaps). Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion here.
Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the
DocRecordstructure are approximate - see the discussion here.
Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion here.