## Preparation

### Installation

We assume that the repo is cloned, all necessary packages are installed, including calling the script:

```./install_packages.sh```

and the code is compiled:

```./build.sh```

### Changing directory to the repo root

In [1]:
cd ../..

/FlexNeuART


### Downloading demo data

1. Download [this file from our Google Drive](https://drive.google.com/file/d/1mDa6J4hNYPyqlS8hVi6bykSbAOMKsDwe/view?usp=sharing) and copy it to the source root directory, where it should be unpacked. As a result, a source directory should contain a sub-directory ``collections/msmarco_doc``.

### Sanity check: statistics on downloaded data should look like this

In [2]:
!scripts/report/get_basic_collect_stat.sh msmarco_doc

Using collection root: collections
Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataInfo return value:  docs AnswerFields.jsonl.gz ,bitext,dev,dev_official,test2019,test2020,train_fusion QuestionFields.jsonl
Using the data input files: AnswerFields.jsonl.gz, QuestionFields.jsonl
Index dirs: docs
Query dirs:  bitext dev dev_official test2019 test2020 train_fusion
Queries/questions:
bitext 352013
dev 5000
dev_official 5193
t

## Indexing (each step takes a few hours)

### Lucene index

In [3]:
!scripts/index/create_lucene_index.sh msmarco_doc

Using collection root: collections
Data directory: collections/msmarco_doc/input_data
Index directory: collections/msmarco_doc/lucene_index
Removing previously created index (if exists)
Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
Using the data input file: AnswerFields.jsonl.gz
JAVA_OPTS=-Xms8388608k -Xmx14680064k -server
Creating a new Lucene index, maximum # of docs to process: 2147483647
Input file name: collections/msmarco_doc/inp

Indexed 3080000 docs
Indexed 3090000 docs
Indexed 3100000 docs
Committing
Indexed 3110000 docs
Indexed 3120000 docs
Indexed 3130000 docs
Indexed 3140000 docs
Indexed 3150000 docs
Committing
Indexed 3160000 docs
Indexed 3170000 docs
Indexed 3180000 docs
Indexed 3190000 docs
Indexed 3200000 docs
Committing
Indexed 3210000 docs
Indexed 3213802 docs


### Forward indices (text is not really necessary for this notebook)

In [4]:
!scripts/index/create_fwd_index.sh msmarco_doc mapdb "text:parsedText text_raw:raw" 

Using collection root: collections
Data directory:            collections/msmarco_doc/input_data
Forward index directory:   collections/msmarco_doc/forward_index/
Clean old index?:          0
Field list definition:     text:parsedText text_raw:raw
Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
JAVA_OPTS=-Xms12582912k -Xmx14680064k -server
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.BuildFwdIndexApp - Processing field: 'text'
[main] INFO

[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 3200000 documents
...
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Finished processing file: collections/msmarco_doc/input_data/docs/AnswerFields.jsonl.gz
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Final statistics: 
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Number of documents 3213802, total number of words 1532042624, average reduction due to keeping only unique words 2.002092
JAVA_OPTS=-Xms12582912k -Xmx14680064k -server
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.BuildFwdIndexApp - Processing field: 'text_raw'
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.BuildFwdIndexApp - Forward index storage type: raw
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.BuildFwdIndexApp - Forward index storage type: mapdb


[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Creating a new forward index, maximum # of docs to process: 2147483647
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 10000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 20000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 30000 documents
...
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 880000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 890000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 900000 documents


[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 2720000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 2730000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 2740000 documents
...
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 3190000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 3200000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Processed 3210000 documents
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Finished processing file: collections/msmarco_doc/input_data/docs/AnswerFields.jsonl.gz
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Final statistics: 
[main] INFO edu.cmu.lti.oaqa.flexneuart.fwdindx.ForwardIndex - Number of documents 3213802, total number of words 0, average reduction due to keeping only unique words 0.000000


### Download and instantiate the model

In [5]:
!wget boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best

--2021-04-23 14:13:36--  http://boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438863972 (419M) [text/plain]
Saving to: ‘model.best’


2021-04-23 14:18:32 (1.42 MB/s) - ‘model.best’ saved [438863972/438863972]



### Here, we do inference on CPU, which is pretty slow. To use a GPU change the ``DEVICE_NAME``.

In [6]:
import torch
#DEVICE_NAME='cuda:0'
MAX_QUERY_LEN=32
MAX_DOC_LEN=512 - 32 - 3
BATCH_SIZE=16
DEVICE_NAME='cpu'
MODEL_FILE='model.best'
model=torch.load(MODEL_FILE, map_location='cpu')
model.to(DEVICE_NAME)



VanillaBertRanker(
  (bert): CustomBertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
         

## Model inference/API demo

In [7]:
COLLECTION='msmarco_doc'

### Execute a query

In [8]:
QUERY_JSON={"DOCNO": "961921", 
            "text": "national park system establish",
             "text_raw": "when was the national park system established", "text_bert_tok": "when was the national park system established"}
QUERY_JSON

{'DOCNO': '961921',
 'text': 'national park system establish',
 'text_raw': 'when was the national park system established',
 'text_bert_tok': 'when was the national park system established'}

In [9]:
from scripts.config import DOCID_FIELD, TEXT_FIELD_NAME, TEXT_RAW_FIELD_NAME

In [10]:
from scripts.py_flexneuart.setup import *
# add Java JAR to the class path
configure_classpath('target')
# create a resource manager
resource_manager=create_featextr_resource_manager(f'collections/{COLLECTION}/forward_index')

In [11]:
from scripts.py_flexneuart.cand_provider import *
# create a candidate provider/generator
cand_prov = create_cand_provider(resource_manager, PROVIDER_TYPE_LUCENE, f'collections/{COLLECTION}/lucene_index')

In [12]:
query_text=QUERY_JSON[TEXT_FIELD_NAME]
query_id=QUERY_JSON[DOCID_FIELD]
query_res=run_text_query(cand_prov, 20, query_text)
query_id, query_res

('961921',
 (1152497,
  [CandidateEntry(doc_id='D2527574', score=17.821687698364258),
   CandidateEntry(doc_id='D1578783', score=17.796754837036133),
   CandidateEntry(doc_id='D2398015', score=17.749736785888672),
   CandidateEntry(doc_id='D2591882', score=17.310184478759766),
   CandidateEntry(doc_id='D2443070', score=17.17824935913086),
   CandidateEntry(doc_id='D2112934', score=17.16891860961914),
   CandidateEntry(doc_id='D2106902', score=16.984983444213867),
   CandidateEntry(doc_id='D1578782', score=16.844358444213867),
   CandidateEntry(doc_id='D1019833', score=16.784225463867188),
   CandidateEntry(doc_id='D3008908', score=16.633424758911133),
   CandidateEntry(doc_id='D2769926', score=16.605653762817383),
   CandidateEntry(doc_id='D797127', score=16.480308532714844),
   CandidateEntry(doc_id='D2443068', score=16.377676010131836),
   CandidateEntry(doc_id='D1578785', score=16.304927825927734),
   CandidateEntry(doc_id='D1462277', score=16.298961639404297),
   CandidateEntry(doc

### Retrieve a document (D1578782 is marked as a relevant entry)

In [13]:
from scripts.py_flexneuart.fwd_index import get_forward_index
raw_indx = get_forward_index(resource_manager, 'text_raw')

In [14]:
DOC_ID='D1578782' # relevant
#DOC_ID='D1462277' # not marked as relevant
doc_text=raw_indx.get_doc_raw(DOC_ID)

In [15]:
print(query_text)
print()
print(doc_text)

national park system establish

national park mashups "national park service's 100 year birthday is in 2016. august 25, 2016 is the 100th birthday of the national park service. starting with yellowstone in 1872 there are over 400 units in the national park service today. how old is the system? the national park service was created by an act of congress and signed by president woodrow wilson on august 25, 1916. yellowstone national park was established by an act signed by president ulysses s. grant on march 1, 1872, as the nation's first national park. the mission of the national park service: the national park service preserves unimpaired the natural and cultural resources and values of the national park system for the enjoyment, education, and inspiration of this and future generations. the national park service cooperates with partners to extend the benefits of natural and cultural resource conservation and outdoor recreation throughout this country and the world. national park mashu

## Score candidate documents

In [16]:
doc_data = {}
bm25_scores = {}
for doc_id, bm25_score in query_res[1]:
    doc_text = raw_indx.get_doc_raw(doc_id)
    doc_data[doc_id] = doc_text
    bm25_scores[doc_id] = bm25_score

query_data = {query_id : query_text}

In [17]:
from scripts.cedr.data import iter_valid_records

data_set = query_data, doc_data
run = {query_id : doc_data.keys()}

for records in iter_valid_records(model, DEVICE_NAME, data_set, run,
                                       BATCH_SIZE,
                                       MAX_QUERY_LEN, MAX_DOC_LEN):
    scores = model(records['query_tok'],
                    records['query_mask'],
                    records['doc_tok'],
                    records['doc_mask'])
    
    
    scores = scores.tolist()

    for qid, doc_id, score in zip(records['query_id'], records['doc_id'], scores):
        print(f'{qid} {doc_id} BM25 score: {bm25_scores[doc_id]} model score: {score}')

961921 D2527574 BM25 score: 17.821687698364258 model score: 1.320546269416809
961921 D1578783 BM25 score: 17.796754837036133 model score: 0.8640079498291016
961921 D2398015 BM25 score: 17.749736785888672 model score: 0.9334412813186646
961921 D2591882 BM25 score: 17.310184478759766 model score: 1.2829281091690063
961921 D2443070 BM25 score: 17.17824935913086 model score: 0.7396844625473022
961921 D2112934 BM25 score: 17.16891860961914 model score: -1.9810694456100464
961921 D2106902 BM25 score: 16.984983444213867 model score: -2.150390625
961921 D1578782 BM25 score: 16.844358444213867 model score: 0.38077297806739807
961921 D1019833 BM25 score: 16.784225463867188 model score: 0.6051344275474548
961921 D3008908 BM25 score: 16.633424758911133 model score: -2.9435813426971436
961921 D2769926 BM25 score: 16.605653762817383 model score: 0.5723978281021118
961921 D797127 BM25 score: 16.480308532714844 model score: 0.06688592582941055
961921 D2443068 BM25 score: 16.377676010131836 model score

### Score the document against the query (under the hood)

In [18]:
query_bert_tok = model.tokenize(query_text)
query_bert_tok

[2120, 2380, 2291, 5323]

In [19]:
doc_bert_tok = model.tokenize(doc_text)
print(doc_bert_tok, len(doc_bert_tok))

[2120, 2136, 8499, 6210, 1010, 2120, 2136, 8499, 3574, 1064, 2394, 9206, 2120, 4215, 3501, 2487, 1997, 1010, 5994, 1010, 2030, 8800, 2000, 1037, 3842, 2004, 1037, 2878, 2475, 1997, 1010, 8800, 2000, 1010, 2030, 8281, 1997, 1037, 3327, 3842, 10760, 2120, 4377, 1997, 3735, 2509, 4678, 8986, 2594, 2030, 14314, 2078, 2549, 1037, 6926, 2030, 3395, 2629, 1037, 2120, 3780, 1626, 9582, 4748, 2615, 3060, 2120, 3519, 2078, 1006, 1999, 2148, 3088, 1007, 1037, 2576, 2283, 1010, 2631, 1999, 4878, 2004, 2019, 3060, 8986, 2929, 1998, 7917, 2045, 2013, 3624, 2000, 2901, 2138, 1997, 2049, 3161, 4559, 2000, 17862, 1024, 1999, 2807, 2180, 2148, 3088, 1005, 1055, 2034, 4800, 22648, 4818, 3864, 1010, 1006, 11113, 13578, 2615, 1012, 1007, 2019, 27421, 14778, 4509, 2120, 2283, 2078, 1006, 1999, 3725, 1007, 1037, 9253, 1011, 6394, 2576, 2283, 1010, 1006, 11113, 13578, 2615, 1007, 24869, 26952, 13033, 2120, 2078, 1996, 1012, 2019, 3296, 9561, 2571, 26300, 2448, 2012, 7110, 13334, 1010, 6220, 1010, 2144, 10011,

### It is important to truncate queries and documents ...

In [20]:
query_bert_tok=query_bert_tok[0:MAX_QUERY_LEN]
doc_bert_tok=doc_bert_tok[0:MAX_DOC_LEN]

### ... and pad queries

In [21]:
from scripts.cedr.data import PAD_CODE

query_bert_tok_pad = query_bert_tok + [PAD_CODE] * (MAX_QUERY_LEN - len(query_bert_tok))
print(query_bert_tok_pad)

[2120, 2380, 2291, 5323, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]


### Call unsqueeze(0) is required to create a batch dimension (we can have multiple queries & documents batched together)

In [22]:
query_tok_tensor_pad = torch.LongTensor(query_bert_tok_pad).unsqueeze(0).to(DEVICE_NAME)
doc_tok_tensor = torch.LongTensor(doc_bert_tok).unsqueeze(0).to(DEVICE_NAME)
len(query_tok_tensor_pad[0]), len(doc_tok_tensor[0])

(32, 477)

In [23]:
query_tok_tensor_pad.shape, doc_tok_tensor.shape

(torch.Size([1, 32]), torch.Size([1, 477]))

In [24]:
query_mask = torch.FloatTensor([1.0] * len(query_bert_tok) + 
                              [0.] * (MAX_QUERY_LEN - len(query_bert_tok))).unsqueeze(0).to(DEVICE_NAME)
doc_mask = torch.ones_like(doc_tok_tensor).float()

In [25]:
query_mask.shape, doc_mask.shape

(torch.Size([1, 32]), torch.Size([1, 477]))

In [26]:
query_mask, doc_mask

(tensor([[1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1.,

In [27]:
model(query_tok_tensor_pad, query_mask, doc_tok_tensor, doc_mask)

tensor([-3.5074], grad_fn=<SqueezeBackward1>)