# Indexing notebook

### Two things to do before we start:
1. Point environment variable `COLLECT_ROOT` to the collection root.
2. Change directory to the location of installed scripts/binaries

In [1]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [6]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


### Lucene indexer options

In [3]:
!./index/create_lucene_index.sh

Using collection root: /home/leo/flexneuart_collections
Specify collection sub-directory, e.g., msmarco_pass (1st arg)
Usage: <collection> [additional options]
Additional options:
-h print help
-exact_match create index for exact match
-index_field indexing field name (default text)
-input_subdir input data sub-directory (default input_data)
-index_subdir index subdirectory (default lucene_index)


By default Lucene using the context of the field `text` to create the full-text index, which is stored in the sub-directory `lucene_index`, but it is possible to create an index for an exact match, use a content of a different field, or store the inde

### Lucene index

In [4]:
!./index/create_lucene_index.sh wikipedia_dpr_nq_sample

Using collection root: /home/leo/flexneuart_collections
Input data directory: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/input_data
Index directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/lucene_index
Index field name:     text
Exact match param:    
Checking input sub-directory: bitext
Checking input sub-directory: dev
Checking input sub-directory: dev_official
Checking input sub-directory: pass_sample
Found indexable data file: pass_sample/AnswerFields.jsonl.gz
Checking input sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
Using the data input file: AnswerFields.jsonl.gz
JAVA_OPTS=-Xms4117329k -Xmx28821303k -server
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.LuceneIndexer - Creating a new Lucene index, maximum # of docs to process: 2147483647 index field name: text exact match

### Forward indices

A forward index allows one to retrieve an original or parsed field content using a document identifier. The forward index can be stored using two foramts and two storage engines (mapdb and lucene). In most cases, default settings (we use `mapdb` directly) works well: It permis the fastest re-rankers when there is enough memory. When there is not enough memory, one can build using an option `-fwd_index_type offsetDict` possibly combined with `-fwd_index_store_type lucene`, which stores data records in a separate file and uses `mapdb` or `lucene` key-value index to store only offsets & lengths.

There are four types of the field, which include:
1. two parsed textual field formats: 
    - Parsed text without positional information (bag-of-words): **parsedBOW**   
    - Parsed text with positional information: **parsedText**
2. original/unparsed/raw text: **textRaw**
3. binary (can be anything): **binary**

The options are printed by the indexing script:

In [5]:
!./index/create_fwd_index.sh 

Using collection root: /home/leo/flexneuart_collections
collection sub-directory, e.g., msmarco_pass (1st arg)
Usage: <collection> <field definition: examples: text:parsedBOW, text_unlemm:parsedText, text_raw:textRaw, dense_embed:binary> [additional options]
Additional options:
-h print help
-clean remove the previous index
-input_subdir input data sub-directory (default input_data)
-index_subdir index subdirectory (default forward_index)
-fwd_index_type forward index type: dataDict, offsetDict
-fwd_index_store_type a forward backend storage type: lucene, mapdb
-expect_doc_qty expected # of documents in the index


Here we create indices sequentially, but they can also be created **in parallel** (independently for each field):

In [None]:
!for field_def in dense:binary text:parsedText \
                  text_unlemm:parsedText \
                  title:parsedBOW \
                  text_bert_tok:parsedText \
                  text_raw:textRaw ; do \
    ./index/create_fwd_index.sh wikipedia_dpr_nq_sample $field_def ; \
done