## Preparation

### Installation

We assume that the repo is cloned, all necessary packages are installed, including calling the script:

```./install_packages.sh```

and the code is compiled:

```./build.sh```

### Changing directory to the repo root

In [None]:
cd ../..

### Downloading demo data

1. Download [this file from our Google Drive](https://drive.google.com/file/d/1p2H-tjdMe69oIJXX0xEIpLLNbHrkO4Xy/view?usp=sharing) and copy it to the source root directory, where it should be unpacked. As a result, a source directory should contain a sub-directory ``collections/msmarco_doc``.

### Sanity check: statistics on downloaded data should look like this

In [2]:
!scripts/report/get_basic_collect_stat.sh msmarco_doc

Using collection root: collections
Checking data sub-directory: bitext
Checking data sub-directory: bitext_msmarco_pass_mixed_qrels
Checking data sub-directory: bitext_msmarco_pass_part0
Checking data sub-directory: bitext_msmarco_pass_part1
Checking data sub-directory: bitext_msmarco_pass_part2
Checking data sub-directory: bitext_msmarco_pass_pseudo_qrels
Checking data sub-directory: bitext_orcas_minqty_5
Checking data sub-directory: dev
Checking data sub-directory: dev1
Checking data sub-directory: dev2
Checking data sub-directory: dev2.single_doc_query
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: lb2020
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train1
Checking data sub-directory: train.dont_use
Found query file: bitext/QuestionFields.jsonl
Found query file: bitext_msmarco_pass_mixed_qrels/QuestionFields.jsonl
Found query file: bitext_msmarco_pass_par

## Indexing (each step takes a few hours)

### Lucene index

In [None]:
!scripts/index/create_lucene_index.sh msmarco_doc

### Forward indices (text is not really necessary for this notebook)

In [None]:
scripts/index/create_fwd_index.sh msmarco_doc mapdb "text:parsedText text_raw:raw" 

### Download and instantiate the model

In [3]:
!wget boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best

--2021-02-07 13:44:14--  http://boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438863972 (419M) [text/plain]
Saving to: ‘model.best’


2021-02-07 13:46:41 (2.86 MB/s) - ‘model.best’ saved [438863972/438863972]



### Here, we do inference on CPU, which is pretty slow. To use a GPU change the ``DEVICE_NAME``.

In [6]:
import torch
#DEVICE_NAME='cuda:0'
MAX_QUERY_LEN=32
MAX_DOC_LEN=512 - 32 - 3
DEVICE_NAME='cpu'
MODEL_FILE='model.best'
model=torch.load(MODEL_FILE, map_location='cpu')
model.to(DEVICE_NAME)

VanillaBertRanker(
  (bert): CustomBertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
         

## Model inference/API demo

In [7]:
COLLECTION='msmarco_doc'

### Execute a query

In [8]:
QUERY_JSON={"DOCNO": "961921", 
            "text": "national park system establish",
             "text_raw": "when was the national park system established", "text_bert_tok": "when was the national park system established"}
QUERY_JSON

{'DOCNO': '961921',
 'text': 'national park system establish',
 'text_raw': 'when was the national park system established',
 'text_bert_tok': 'when was the national park system established'}

In [9]:
from scripts.config import DOCID_FIELD, TEXT_FIELD_NAME, TEXT_RAW_FIELD_NAME

In [10]:
from scripts.py_flexneuart.setup import *
# add Java JAR to the class path
configure_classpath('target')
# create a resource manager
resource_manager=create_featextr_resource_manager(f'collections/{COLLECTION}/forward_index')

In [11]:
from scripts.py_flexneuart.cand_provider import *
# create a candidate provider/generator
cand_prov = create_cand_provider(resource_manager, PROVIDER_TYPE_LUCENE, f'collections/{COLLECTION}/lucene_index')

In [12]:
query_text=QUERY_JSON[TEXT_FIELD_NAME]
query_toks=query_text.split()
run_query(cand_prov, 20, query_toks)

(1204206,
 [CandidateEntry(doc_id='D2527574', score=18.659997940063477),
  CandidateEntry(doc_id='D2398015', score=18.492298126220703),
  CandidateEntry(doc_id='D1578785', score=18.234092712402344),
  CandidateEntry(doc_id='D2189735', score=18.2298583984375),
  CandidateEntry(doc_id='D1578782', score=17.947647094726562),
  CandidateEntry(doc_id='D2527573', score=17.892498016357422),
  CandidateEntry(doc_id='D1578784', score=17.88416862487793),
  CandidateEntry(doc_id='D2106902', score=17.869140625),
  CandidateEntry(doc_id='D2591882', score=17.70314598083496),
  CandidateEntry(doc_id='D2443070', score=17.63814926147461),
  CandidateEntry(doc_id='D1578783', score=17.51651382446289),
  CandidateEntry(doc_id='D3525662', score=17.447235107421875),
  CandidateEntry(doc_id='D2769926', score=17.322866439819336),
  CandidateEntry(doc_id='D1737386', score=17.243505477905273),
  CandidateEntry(doc_id='D1514002', score=17.16539192199707),
  CandidateEntry(doc_id='D14552', score=17.148212432861328

### Retrieve a document (D1578782 is marked as a relevant entry)

In [13]:
from scripts.py_flexneuart.fwd_index import get_forward_index
raw_indx = get_forward_index(resource_manager, 'text_raw')

In [14]:
DOC_ID='D1578782' # relevant
#DOC_ID='D1462277' # not marked as relevant
doc_text=raw_indx.get_doc_raw(DOC_ID)

In [15]:
print(query_text)
print()
print(doc_text)

national park system establish

national park mashups "national park service's 100 year birthday is in 2016. august 25, 2016 is the 100th birthday of the national park service. starting with yellowstone in 1872 there are over 400 units in the national park service today. how old is the system? the national park service was created by an act of congress and signed by president woodrow wilson on august 25, 1916. yellowstone national park was established by an act signed by president ulysses s. grant on march 1, 1872, as the nation's first national park. the mission of the national park service: the national park service preserves unimpaired the natural and cultural resources and values of the national park system for the enjoyment, education, and inspiration of this and future generations. the national park service cooperates with partners to extend the benefits of natural and cultural resource conservation and outdoor recreation throughout this country and the world. national park mashu

### Score the document against the query

In [16]:
query_bert_tok = model.tokenize(query_text)
query_bert_tok

[2120, 2380, 2291, 5323]

In [17]:
doc_bert_tok = model.tokenize(doc_text)
print(doc_bert_tok, len(doc_bert_tok))

[2120, 2380, 16137, 6979, 4523, 1000, 2120, 2380, 2326, 1005, 1055, 2531, 2095, 5798, 2003, 1999, 2355, 1012, 2257, 2423, 1010, 2355, 2003, 1996, 16919, 5798, 1997, 1996, 2120, 2380, 2326, 1012, 3225, 2007, 29231, 1999, 7572, 2045, 2024, 2058, 4278, 3197, 1999, 1996, 2120, 2380, 2326, 2651, 1012, 2129, 2214, 2003, 1996, 2291, 1029, 1996, 2120, 2380, 2326, 2001, 2580, 2011, 2019, 2552, 1997, 3519, 1998, 2772, 2011, 2343, 23954, 4267, 2006, 2257, 2423, 1010, 4947, 1012, 29231, 2120, 2380, 2001, 2511, 2011, 2019, 2552, 2772, 2011, 2343, 22784, 1055, 1012, 3946, 2006, 2233, 1015, 1010, 7572, 1010, 2004, 1996, 3842, 1005, 1055, 2034, 2120, 2380, 1012, 1996, 3260, 1997, 1996, 2120, 2380, 2326, 1024, 1996, 2120, 2380, 2326, 18536, 4895, 5714, 4502, 27559, 1996, 3019, 1998, 3451, 4219, 1998, 5300, 1997, 1996, 2120, 2380, 2291, 2005, 1996, 20195, 1010, 2495, 1010, 1998, 7780, 1997, 2023, 1998, 2925, 8213, 1012, 1996, 2120, 2380, 2326, 17654, 2015, 2007, 5826, 2000, 7949, 1996, 6666, 1997, 3019,

### It is important to truncate queries and documents ...

In [18]:
query_bert_tok=query_bert_tok[0:MAX_QUERY_LEN]
doc_bert_tok=doc_bert_tok[0:MAX_DOC_LEN]

### ... and pad queries

In [19]:
from scripts.cedr.data import PAD_CODE

query_bert_tok_pad = query_bert_tok + [PAD_CODE] * (MAX_QUERY_LEN - len(query_bert_tok))
print(query_bert_tok_pad)

[2120, 2380, 2291, 5323, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]


### .unsqueeze(0) is required to create a batch dimension (we can have multiple queries & documents batched together)

In [20]:
query_tok_tensor_pad = torch.LongTensor(query_bert_tok_pad).unsqueeze(0).to(DEVICE_NAME)
doc_tok_tensor = torch.LongTensor(doc_bert_tok).unsqueeze(0).to(DEVICE_NAME)
len(query_tok_tensor_pad[0]), len(doc_tok_tensor[0])

(32, 393)

In [21]:
query_tok_tensor_pad.shape, doc_tok_tensor.shape

(torch.Size([1, 32]), torch.Size([1, 393]))

In [22]:
query_mask = torch.FloatTensor([1.0] * len(query_bert_tok) + 
                              [0.] * (MAX_QUERY_LEN - len(query_bert_tok))).unsqueeze(0).to(DEVICE_NAME)
doc_mask = torch.ones_like(doc_tok_tensor).float()

In [23]:
query_mask.shape, doc_mask.shape

(torch.Size([1, 32]), torch.Size([1, 393]))

In [24]:
query_mask, doc_mask

(tensor([[1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1.,

In [25]:
model(query_tok_tensor_pad, query_mask, doc_tok_tensor, doc_mask)

tensor([0.5071], grad_fn=<SqueezeBackward1>)