## Notes & pre-requisites

This an **end-to-end** reproduction notebook that:
1. Downloads & unpacks data
2. Converts data to FlexNeuART JSONL format
3. Creates indices & trains IBM Model 1 using GIZA
4. The installation procedure is covered in a [separate notebook](https://github.com/oaqa/FlexNeuART/blob/master/INSTALL.md).
5. It is best to use a mini-release:
```
git checkout tags/repr2020-12-06
```
6. The performance of **your fusion model may vary somewhat** (and be slightly different from what we got here), but we expect the difference to be small.

## Data download

### Go to the root source directory

In [None]:
%cd ../../..

### Create raw-data directory and download data:

In [2]:
!mkdir -p collections/msmarco_doc/input_raw

In [3]:
!scripts/data_convert/msmarco/download_msmarco_doc.sh \
  collections/msmarco_doc/input_raw

Downloading https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-qrels.tsv.gz
--2020-12-09 15:09:49--  https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-qrels.tsv.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38553 (38K) [application/x-gzip]
Saving to: ‘msmarco-docdev-qrels.tsv.gz’


2020-12-09 15:09:50 (409 KB/s) - ‘msmarco-docdev-qrels.tsv.gz’ saved [38553/38553]

Downloading https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz
--2020-12-09 15:09:50--  https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP re

## Preprocessing

### Create the directory to store pre-processed data and run the conversion:

In [4]:
!mkdir -p collections/msmarco_doc/input_data 

In [None]:
!scripts/data_convert/msmarco/convert_msmarco_doc.sh \
  collections/msmarco_doc/input_raw  \
  msmarco_doc

### Split the training queries to carve-out a separate development and fusion sets

In [6]:
!mv collections/msmarco_doc/input_data/dev/ collections/msmarco_doc/input_data/dev_official

In [12]:
!scripts/data_convert/split_queries.sh msmarco_doc train train_fusion tmp -part1_qty 10000

Namespace(data_dir='collections/msmarco_doc/input_data', input_subdir='train', out_subdir1='train_fusion', out_subdir2='tmp', part1_fract=None, part1_qty=10000, seed=0)
Read all the queries
Read all the QRELs                                      
# of QRELs with query IDs not present in any part 0
The first part will have 10000 documents
Part train_fusion # of queries: 10000 # of QRELs: 10000
Part tmp # of queries: 357013 # of QRELs: 357013


In [13]:
!scripts/check_utils/check_split_queries.sh     msmarco_doc train train_fusion tmp

Namespace(data_dir='collections/msmarco_doc/input_data', input_subdir='train', out_subdir1='train_fusion', out_subdir2='tmp')
Read all the queries from the main dir
Read all the QRELs from the main dir                    
Part train_fusion # of queries # 10000 of queries with at least one QREL: 10000
Part tmp # of queries # 357013 of queries with at least one QREL: 357013
# of queries in the original folder: 367013 # of queries in split folders: 367013 # of queries in the symmetric diff. 0
Check is successful!


In [14]:
!scripts/data_convert/split_queries.sh msmarco_doc tmp dev bitext -part1_qty 5000

Namespace(data_dir='collections/msmarco_doc/input_data', input_subdir='tmp', out_subdir1='dev', out_subdir2='bitext', part1_fract=None, part1_qty=5000, seed=0)
Read all the queries
Read all the QRELs                                      
# of QRELs with query IDs not present in any part 0
The first part will have 5000 documents
Part dev # of queries: 5000 # of QRELs: 5000
Part bitext # of queries: 352013 # of QRELs: 352013


In [15]:
!scripts/check_utils/check_split_queries.sh     msmarco_doc tmp dev bitext

Namespace(data_dir='collections/msmarco_doc/input_data', input_subdir='tmp', out_subdir1='dev', out_subdir2='bitext')
Read all the queries from the main dir
Read all the QRELs from the main dir                    
Part dev # of queries # 5000 of queries with at least one QREL: 5000
Part bitext # of queries # 352013 of queries with at least one QREL: 352013
# of queries in the original folder: 357013 # of queries in split folders: 357013 # of queries in the symmetric diff. 0
Check is successful!


In [16]:
!rm -rf collections/msmarco_doc/input_data/tmp/

In [19]:
!rm -rf collections/msmarco_doc/input_data/train/

## Sanity check: dataset statistics

In [20]:
!scripts/report/get_basic_collect_stat.sh msmarco_doc

Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataInfo return value:  docs AnswerFields.jsonl.gz ,bitext,dev,dev_official,test2019,test2020,train_fusion QuestionFields.jsonl
Using the data input files: AnswerFields.jsonl.gz, QuestionFields.jsonl
Index dirs: docs
Query dirs:  bitext dev dev_official test2019 test2020 train_fusion
Queries/questions:
bitext 352013
dev 5000
dev_official 5193
test2019 200
test2020 200
train_fusi

## Indexing

### Lucene index

In [None]:
!scripts/index/create_lucene_index.sh msmarco_doc

### Forward indices (text_raw is not really necessary for this notebook)

In [None]:
!field_def="title_unlemm:parsedText url_unlemm:parsedText \
            text:parsedText body:parsedText \
            text_bert_tok:parsedText \
            text_raw:raw"   ;\
scripts/index/create_fwd_index.sh msmarco_doc mapdb "$field_def"


## Training Model 1

## Generating parallel corpora (bitext) for fields: title_unlemm, url_unlemm, body, text_bert_tok

### Note that bitext is generated for a pair of (index) and (query) fields. Query fields may be different, but they should have a similar tokenization/lemmatization approach!

In [None]:
!scripts/giza/export_bitext_plain.sh msmarco_doc title_unlemm text_unlemm 1.5

In [None]:
!scripts/giza/export_bitext_plain.sh msmarco_doc url_unlemm text_unlemm 1.5

In [None]:
!scripts/giza/export_bitext_plain.sh msmarco_doc body text_unlemm 1.5

In [None]:
!scripts/giza/export_bitext_plain.sh msmarco_doc text_bert_tok text_bert_tok 1.5

### Training Model 1 (using MGIZA) for fields title_unlemm, url_unlemm, body, text_bert_tok

In [None]:
!time scripts/giza/create_tran.sh msmarco_doc title_unlemm

In [None]:
!time scripts/giza/create_tran.sh msmarco_doc url_unlemm

In [None]:
!time scripts/giza/create_tran.sh msmarco_doc body

In [None]:
!time scripts/giza/create_tran.sh msmarco_doc text_bert_tok

### Output train/test perplexity (sanity check)

In [34]:
!cat /hdd2/BOL1PI/msrepro/FlexNeuART/collections/msmarco_doc/derived_data/giza/title_unlemm.orig/output.perp 

#trnsz	tstsz	iter	model	trn-pp		test-pp		trn-vit-pp		tst-vit-pp
779240	0	0	Model1	261200		N/A		1.06565e+06		N/A
779240	0	1	Model1	115.107		N/A		159.934		N/A
779240	0	2	Model1	73.5553		N/A		91.3005		N/A
779240	0	3	Model1	68.0878		N/A		80.6573		N/A
779240	0	4	Model1	66.5902		N/A		77.0477		N/A


In [35]:
!cat /hdd2/BOL1PI/msrepro/FlexNeuART/collections/msmarco_doc/derived_data/giza/url_unlemm.orig/output.perp 

#trnsz	tstsz	iter	model	trn-pp		test-pp		trn-vit-pp		tst-vit-pp
1378916	0	0	Model1	576745		N/A		2.50217e+06		N/A
1378916	0	1	Model1	195.208		N/A		303.526		N/A
1378916	0	2	Model1	135.157		N/A		187.91		N/A
1378916	0	3	Model1	124.333		N/A		160.974		N/A
1378916	0	4	Model1	120.828		N/A		150.774		N/A


In [36]:
!cat /hdd2/BOL1PI/msrepro/FlexNeuART/collections/msmarco_doc/derived_data/giza/body.orig/output.perp 

#trnsz	tstsz	iter	model	trn-pp		test-pp		trn-vit-pp		tst-vit-pp
86566782	0	0	Model1	5.45427e+06		N/A		2.49975e+07		N/A
86566782	0	1	Model1	2566.44		N/A		4654.73		N/A
86566782	0	2	Model1	2003.74		N/A		3260.64		N/A
86566782	0	3	Model1	1890.31		N/A		2886.05		N/A
86566782	0	4	Model1	1848.82		N/A		2719.57		N/A


In [37]:
!cat /hdd2/BOL1PI/msrepro/FlexNeuART/collections/msmarco_doc/derived_data/giza/text_bert_tok.orig/output.perp 

#trnsz	tstsz	iter	model	trn-pp		test-pp		trn-vit-pp		tst-vit-pp
98617546	0	0	Model1	60743.3		N/A		inf		N/A
98617546	0	1	Model1	1422.96		N/A		6025.25		N/A
98617546	0	2	Model1	1160.37		N/A		3929.42		N/A
98617546	0	3	Model1	1096.46		N/A		3276.15		N/A
98617546	0	4	Model1	1069.87		N/A		2959.66		N/A


### Convert MGIZA output to our format and filter out lower-frequency entries

In [None]:
!col=msmarco_doc ; \
 min_prob=0.001 ; \
 max_word_qty=1000000 ; \
for field in title_unlemm url_unlemm body text_bert_tok ; do \
  scripts/giza/filter_tran_table_and_voc.sh $col $field $min_prob $max_word_qty ; \
  if [ "$?" != "0" ] ; then echo "Failure for field: $field!!!" ; break ; fi \
done ; \
echo "All is done!"

## Run experiments

### Optionally warm up the indices

In [None]:
!scripts/exper/warmup_indices.sh msmarco_doc

### Copying experimental descriptors from the github repo to the respective collection sub-folder

In [64]:
!cp -r scripts/data_convert/msmarco/exper_desc.lb2020-12-04/ collections/msmarco_doc

### Baseline: BM25 run on the "official" development set

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/bm25_test.json  \
   -test_part dev_official \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end this script should output:

```
================================================================================
N=100
================================================================================
# of queries:    5193
NDCG@10:        0.313800
NDCG@20:        0.339600
NDCG@100:       0.372600
ERR@20:         0.016410
P20:            0.030200
MAP:            0.267100
MRR:            0.267100
Recall:         0.781822
GDEVAL NDCG@20: 0.339560
```

### Train the LAMBDAMART model using train_fusion and test it on dev_official

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/best_classic_ir_expand_full_lmart_train.json  \
   -train_part train_fusion \
   -test_part dev_official \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end this script should output:

```
================================================================================
N=100
================================================================================
# of queries:    5193
NDCG@10:        0.396600
NDCG@20:        0.421000
NDCG@100:       0.447700
ERR@20:         0.020940
P20:            0.035600
MAP:            0.338900
MRR:            0.338900
Recall:         0.851916
GDEVAL NDCG@20: 0.421030
```

### Location of logs, trained models, and TREC-style runs

In [74]:
!ls collections/msmarco_doc/results/dev_official/feat_exper/best_classic_ir_full_lmart_expand

exper.log  letor  rep  trec_runs


### Copy the trained model to the location specified in the descriptors and test it on TREC NIST 2019 data.

In [67]:
!cp collections/msmarco_doc/results/dev_official/feat_exper/best_classic_ir_full_lmart_expand/letor/out_msmarco_doc_train_fusion_20.model collections/msmarco_doc/exper_desc.lb2020-12-04/models/lmart.model

In [None]:
!scripts/exper/run_experiments.sh   \
   msmarco_doc  \
   exper_desc.lb2020-12-04/best_classic_ir_expand_full_lmart_test.json  \
   -test_part test2019 \
   -no_separate_shell   \
   -metric_type RR@100 \
   -test_cand_qty_list 100,1000

### In the end the script should output:

```
================================================================================
N=100
================================================================================
# of queries:    43
NDCG@10:        0.589900
NDCG@20:        0.561800
NDCG@100:       0.544500
ERR@20:         0.394260
P20:            0.577900
MAP:            0.262600
MRR:            0.888400
Recall:         0.219494
GDEVAL NDCG@20: 0.520620
```