## Notes & pre-requisites

1. This a reproduction notebook that operates on preprocessed data in FlexNeuART JSONL format
2. It does not require running GIZA to generate IBM Model 1 (these models are already trained)
3. It assumes the user downloaded [this file from our Google Drive](https://drive.google.com/file/d/1p2H-tjdMe69oIJXX0xEIpLLNbHrkO4Xy/view?usp=sharing) and copied it to the source root directory.
4. The installation procedure is covered in a [separate notebook](INSTALL_LEGACY.md).
5. One should use the following mini-release:
```
git checkout tags/repr2020-12-06
```
6. The performance of **your fusion model may vary somewhat** (and be slightly different from what we got here), but we expect the difference to be small.

## Data unpacking/preparation

1. Download [this file from our Google Drive](https://drive.google.com/file/d/1p2H-tjdMe69oIJXX0xEIpLLNbHrkO4Xy/view?usp=sharing) and copy it to the source root directory.

### Go to the root source directory & unpack data

In [None]:
%cd ../../..

In [None]:
!tar xvf msmarco_docs_data_2020-12-06.tar

## Sanity check: dataset statistics

In [6]:
!scripts/report/get_basic_collect_stat.sh msmarco_doc

Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataInfo return value:  docs AnswerFields.jsonl.gz ,bitext,dev,dev_official,test2019,test2020,train_fusion QuestionFields.jsonl
Using the data input files: AnswerFields.jsonl.gz, QuestionFields.jsonl
Index dirs: docs
Query dirs:  bitext dev dev_official test2019 test2020 train_fusion
Queries/questions:
bitext 352013
dev 5000
dev_official 5193
test2019 200
test2020 200
train_fusi

## Indexing

### Lucene index

In [None]:
!scripts/index/create_lucene_index.sh msmarco_doc

### Forward indices (text_raw is not really necessary for this notebook)

In [None]:
!field_def="title_unlemm:parsedText url_unlemm:parsedText \
            text:parsedText body:parsedText \
            text_bert_tok:parsedText \
            text_raw:raw"   ;\
scripts/index/create_fwd_index.sh msmarco_doc mapdb "$field_def"

## Run experiments

### Optionally warm up the indices

In [None]:
!scripts/exper/warmup_indices.sh msmarco_doc

### Baseline: BM25 run on the "official" development set

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/bm25_test.json  \
   -test_part dev_official \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end this script should output:

```
================================================================================
N=100
================================================================================
# of queries:    5193
NDCG@10:        0.313800
NDCG@20:        0.339600
NDCG@100:       0.372600
ERR@20:         0.016410
P20:            0.030200
MAP:            0.267100
MRR:            0.267100
Recall:         0.781822
GDEVAL NDCG@20: 0.339560
```

### Train the LAMBDAMART model using train_fusion and test it on dev_official

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/best_classic_ir_expand_full_lmart_train.json  \
   -train_part train_fusion \
   -test_part dev_official \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end this script should output:

```
================================================================================
N=100
================================================================================
# of queries:    5193
NDCG@10:        0.396600
NDCG@20:        0.421000
NDCG@100:       0.447700
ERR@20:         0.020940
P20:            0.035600
MAP:            0.338900
MRR:            0.338900
Recall:         0.851916
GDEVAL NDCG@20: 0.421030
```

### Location of logs, trained models, and TREC-style runs

In [32]:
!ls collections/msmarco_doc/results/dev_official/feat_exper/best_classic_ir_full_lmart_expand

exper.log  letor  rep  trec_runs


### Copy the trained model to the location specified in the descriptors and test it on TREC NIST 2019 data.

In [33]:
!cp collections/msmarco_doc/results/dev_official/feat_exper/best_classic_ir_full_lmart_expand/letor/out_msmarco_doc_train_fusion_20.model collections/msmarco_doc/exper_desc.lb2020-12-04/models/lmart.model

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/best_classic_ir_expand_full_lmart_test.json  \
   -test_part test2019 \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end the script should output:

```
================================================================================
N=100
================================================================================
# of queries:    43
NDCG@10:        0.589900
NDCG@20:        0.561800
NDCG@100:       0.544500
ERR@20:         0.394260
P20:            0.577900
MAP:            0.262600
MRR:            0.888400
Recall:         0.219494
GDEVAL NDCG@20: 0.520620
```