# Data preparation/processing notebook

### First we need to move to the top-level directory.

In [1]:
cd ../../..

/home/leo/SourceTreeGit/FlexNeuART.refact2021


This example covers a Manner subset of the Yahoo Answers Comprehensive.
However, a similar procedure can be applied to a bigger collection. All
experiments assume the variable `COLLECT_ROOT` in the script `scripts/config.sh` 
is set to `collections` and that all collections are stored in the `collections`
sub-directory (relative to the source code root).


Create raw-data directory and store raw data there:
```
mkdir -p collections/manner/input_pre_raw

cp <data path>/manner.xml.bz2 collections/manner/input_pre_raw/
```

In [4]:
!ls collections/manner/input_pre_raw

manner.xml.bz2


Now, we need to split the data. The following command creates  several
training and testing subsets, including a ``bitext`` subset that can
be used to train either IBM Model 1 or a neural IR model. 
We would reserve a much smaller ``train`` data set to train
a fusion/LETOR model that could combine several signals:

In [6]:
!mkdir -p collections/manner/input_raw/

In [10]:
!scripts/data_convert/yahoo_answers/split_yahoo_answers_input.sh \
  -i collections/manner/input_pre_raw/manner.xml.bz2  \
  -o collections/manner/input_raw/manner-v2.0 \
  -n dev1,dev2,test,train,bitext \
  -p 0.05,0.05,0.1,0.1,0.7

Using probabilities:
dev1 : 0.05
dev2 : 0.05
test : 0.1
train : 0.1
bitext : 0.7
Processed 1000 documents
Processed 2000 documents
Processed 3000 documents
Processed 4000 documents
Processed 5000 documents
Processed 6000 documents
Processed 7000 documents
Processed 8000 documents
Processed 9000 documents
Processed 10000 documents
Processed 11000 documents
Processed 12000 documents
Processed 13000 documents
Processed 14000 documents
Processed 15000 documents
Processed 16000 documents
Processed 17000 documents
Processed 18000 documents
Processed 19000 documents
Processed 20000 documents
Processed 21000 documents
Processed 22000 documents
Processed 23000 documents
Processed 24000 documents
Processed 25000 documents
Processed 26000 documents
Processed 27000 documents
Processed 28000 documents
Processed 29000 documents
Processed 30000 documents
Processed 31000 documents
Processed 32000 documents
Processed 33000 documents
Processed 34000 documents
Processed 35000 documents
Processed 36000 do

Finally, we can create input data in the JSONL format. Note that the last argument defines a 
part of the collection that is used to create a parallel corpus (i.e,
a bitext), which is generated in addition to JSONL input files:

In [None]:
!scripts/data_convert/yahoo_answers/convert_yahoo_answers.sh \
  manner \
  dev1,dev2,test,train,bitext \
  bitext

## Sanity checks
As a basic sanity check, it is recommended to run the following script:

In [14]:
!scripts/report/get_basic_collect_stat.sh manner

Using collection root: collections
Checking data sub-directory: bitext
Found indexable data file: bitext/AnswerFields.jsonl.gz
Checking data sub-directory: dev1
Found indexable data file: dev1/AnswerFields.jsonl.gz
Checking data sub-directory: dev2
Found indexable data file: dev2/AnswerFields.jsonl.gz
Checking data sub-directory: test
Found indexable data file: test/AnswerFields.jsonl.gz
Checking data sub-directory: train
Found indexable data file: train/AnswerFields.jsonl.gz
Found query file: bitext/QuestionFields.jsonl
Found query file: dev1/QuestionFields.jsonl
Found query file: dev2/QuestionFields.jsonl
Found query file: test/QuestionFields.jsonl
Found query file: train/QuestionFields.jsonl
getIndexQueryDataInfo return value:  bitext,dev1,dev2,test,train AnswerFields.jsonl.gz ,bitext,dev1,dev2,test,train QuestionFields.jsonl
Using the data input files: AnswerFields.jsonl.gz, QuestionFields.jsonl
Index dirs: bitext dev1 dev2 test train
Query dirs:  bitext dev1 dev2 test train
Querie

As a more thorough check, we would like to ensure that the split collection
does not have data leaks, i.e., similar question-answer pairs shared among different splits.
It is most crucial to check for overlaps between parts ``dev1`` (``dev2``, ``test``) and ``bitext``:
as well as between any testing subset and ``train``. For example:

In [15]:
!./scripts/check_utils/check_split_leak.py \
  --data_dir collections/manner/input_data/ \
  --input_subdir1 dev1 \
  --input_subdir2 bitext \
  -k 1  --min_jacc 0.75 

Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib
Namespace(data_dir='collections/manner/input_data/', input_subdir1='dev1', input_subdir2='bitext', k=1, min_jacc=0.75, sample_prob1=1.0, sample_prob2=1.0, use_hnsw=False)
Read 7034 queries from collections/manner/input_data/dev1/QuestionFields.jsonl sampled 7034
Read 99950 queries from collections/manner/input_data/bitext/QuestionFields.jsonl sampled 99950
Read 7034 qrel sets from collections/manner/input_data/dev1/qrels.txt
Read 99950 qrel sets from collections/manner/input_data/bitext/qrels.txt
loading answers: 40497it [00:00, 42570.85it/s]
Read 40497 answers from collections/manner/input_data/dev1/AnswerFields.jsonl.gz
loading answers: 572790it [00:13, 42713.65it/s]
Read 572790 answers from collections/manner/input_data/bitext/AnswerFields.jsonl.gz
k-NN search method brute_force
reading quer

## Indexing
Lucene index:

In [None]:
!scripts/index/create_lucene_index.sh manner

Then, create a forward index. There are several types of the index with different 
space/efficiency/tradeoffs:
1. `mapdb` is the fastest, but not the smallest and not the most memory efficient index. Its size may also be limited by the number of `mmap` ranges permited by your OS. 
2. `flatdata` requires less memory at index time, but it is somewhat slower at re-ranking.

In [None]:
!scripts/index/create_fwd_index.sh \
  manner \
  mapdb \
  'text:parsedBOW text_unlemm:parsedText text_bert_tok:parsedText text_raw:raw'

The last line defines a type of the index for each indexed field. 
At a high level, there are two types of the field: a parsed text field and a raw field.
The raw text field keeps text "as is". A parsed field processor white-space tokenizes the text and compiles token statistics.
More specifically:
1. `parsedBOW` index keeps only a bag of words;
2. `parsedText` keeps the original word sequence;
3. `raw` is the index that stores text "as is" without any changes.

# Generating & using optional (derived) data

## Training an IBM Model 1 model

Here we create a model for the field `text_bert_tok`. This script requires MGIZA to be compiled (make sure you ran the script `install_packages.sh`):

In [None]:
!scripts/giza/create_tran.sh manner text_bert_tok

It further needs to cleaned-up and converted to a binary format (infrequent tokens need to be filtered out as well). 
Note that for BERT-tokenized text, which has less than
100K unique tokens, the maximum number of most frequent words
is too high. However, it makes sense for, e.g.,
unlemmatized text fields with large vocabularies.

In [None]:
!min_tran_prob=0.001 ; top_word_qty=1000000 ; echo $min_tran_prob ; top_word_qty=100000 ; \
scripts/giza/filter_tran_table_and_voc.sh \
    manner \
    text_bert_tok \
    $min_tran_prob \
    $top_word_qty

## Training CEDR neural ranking models

Training requires exporting data in the format of the 
CEDR framework ([MacAvaney et al' 2019](https://github.com/Georgetown-IR-Lab/cedr)).
The following command
generates training data in the CEDR format for the collection `manner`
and the field `text_raw`. The traing data is generated from the split `bitext`, 
whereas split `dev1` is used to generate validation data:

In [18]:
!scripts/export_train/export_cedr.sh \
  manner \
  text_raw \
  bitext \
  dev1 \
  -thread_qty 4 \
  -hard_neg_qty 0 \
  -sample_easy_neg_qty 0 \
  -sample_med_neg_qty 20

Using collection root: collections
Train split: bitext
Eval split: dev1
Random seed: 0
Output directory: collections/manner/derived_data/cedr_train/text_raw
# of threads: 4
Index field: text_raw
Query field: text_raw
Candidate provider parameters:  -cand_prov "lucene" -u "collections/manner/lucene_index" 
Resource parameters: -fwd_index_dir "collections/manner/forward_index/" -embed_dir "collections/manner/derived_data/embeddings/" -giza_root_dir "collections/manner/derived_data/giza" 
A # of hard/medium/easy samples per query: 0/20/0
A max. # of candidate records to generate training data: 500
A max. # of candidate records to generate test data: 10
Max train query # param.: 
Max test/dev query # param.: 
Case handling param: 
JAVA_OPTS=-Xms16469316k -Xmx28821303k -server
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportTrainPairs - Candidate provider type: lucene URI: collections/manner/lucene_index config: null
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.ExportTrainPairs - Number 

To train the model we can use a wrapper convenience script that reads most parameters from a configuration file. First it needs to be copied to a collection directory:

In [23]:
!cp scripts/exper/sample_model_conf/manner/bert_vanilla_manner.json collections/manner/exper_desc

Note that the following ``train_model.sh`` scripts assumes that the training data path is **relative** to the ``derived_data`` subdirectory while other paths are **relative** to the collection root. The training script has a number of options (check them out by running with the option ``-h``). Here is how one can run a training script (remember this requires a GPU and pytorch with CUDA support):

In [None]:
!scripts/cedr/train_model.sh manner cedr_train/text_raw vanilla_bert -json_conf exper_desc/bert_vanilla_manner.json 

The scripts runs, both training and evaluation. The respective statistics is stored in a JSON file:

In [2]:
!cat collections/manner/derived_data/ir_models/vanilla_bert/base/0/train_stat.json 

{
    "0": {
        "loss": 0.2031547787311188,
        "score": 0.1208771612093101,
        "lr": 0.0002,
        "bert_lr": 2e-05,
        "train_time": 7956.939112901688,
        "validation_time": 688.1996886730194
    }
}

This can also be readily compared to the BM25 score **using the same query subset** by running the following:

In [4]:
!trec_eval/trec_eval collections/manner/derived_data/cedr_train/text_raw/qrels.txt collections/manner/derived_data/cedr_train/text_raw/test_run.txt

runid                 	all	fake_run
num_q                 	all	7033
num_ret               	all	70312
num_rel               	all	40489
num_rel_ret           	all	4273
map                   	all	0.0982
gm_map                	all	0.0004
Rprec                 	all	0.1018
bpref                 	all	0.1512
recip_rank            	all	0.2535
iprec_at_recall_0.00  	all	0.2565
iprec_at_recall_0.10  	all	0.2278
iprec_at_recall_0.20  	all	0.1782
iprec_at_recall_0.30  	all	0.1320
iprec_at_recall_0.40  	all	0.0989
iprec_at_recall_0.50  	all	0.0896
iprec_at_recall_0.60  	all	0.0468
iprec_at_recall_0.70  	all	0.0439
iprec_at_recall_0.80  	all	0.0349
iprec_at_recall_0.90  	all	0.0340
iprec_at_recall_1.00  	all	0.0340
P_5                   	all	0.0924
P_10                  	all	0.0608
P_15                  	all	0.0405
P_20                  	all	0.0304
P_30                  	all	0.0203
P_100                 	all	0.0061
P_200                 	all	0.0030
P_500                 	a

We can see that BM25 has the mean average precision (MAP) of ``0.0982``, which is about **23% worse compared to the BERT model**.

# Running basic experiments

First let us create a directory to store experiment descriptors:

In [19]:
!mkdir -p collections/manner/exper_desc

## Tuning BM25
A tuning procedure simply executes a number of descriptor files
with various BM25 parameters. To create descriptors one runs:


In [24]:
!scripts/gen_exper_desc/gen_bm25_tune_json_desc.py \
  --index_field_name text \
  --query_field_name text \
  --outdir collections/manner/exper_desc/ \
  --exper_subdir tuning \
  --rel_desc_path exper_desc

Namespace(exper_subdir='tuning', index_field_name='text', outdir='collections/manner/exper_desc/', query_field_name='text', rel_desc_path='exper_desc')


The main experimental descriptor is going to be stored in  `collections/manner/exper_desc/bm25tune.json`,
whereas auxiliary descriptors are stored in `collections/manner/exper_desc/bm25tune/`

Now we can run tuning experiments where we train on `train` and test on `dev1`:

In [25]:
!scripts/exper/run_experiments.sh \
  manner \
  exper_desc/bm25tune_text_text.json \
  -test_part dev1

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/manner/exper_desc/bm25tune_text_text.json
Default test set:                                           dev1
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=0.4_b=0.3
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=0.4_b=0.3.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 20972, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.4_b=0.3
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.4_b=0.3/exper.log
Waiting for 1 child processes
Process with pid=20972 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25

Process with pid=22177 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=0.6_b=0.5
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=0.6_b=0.5.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 22265, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.6_b=0.5
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.6_b=0.5/exper.log
Waiting for 1 child processes
Process with pid=22265 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=0.8_b=0.5
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=0.8_b=0.5.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 22350, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.8_b=0.5
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=0.8_b=0.5/exper.log
Wait

Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=1_b=0.7
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=1_b=0.7.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 23658, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1_b=0.7
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1_b=0.7/exper.log
Waiting for 1 child processes
Process with pid=23658 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=1.2_b=0.7
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=1.2_b=0.7.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 23743, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.2_b=0.7
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.2_b=0.7/exper.log
Waiting for 1 child processes
Process with pid=23743 finis

Started a process 25063, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.4_b=0.9
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.4_b=0.9/exper.log
Waiting for 1 child processes
Process with pid=25063 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=1.6_b=0.9
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=1.6_b=0.9.json
testOnly:1
modelFinal:exper_desc/models/one_feat.model
Started a process 25147, working dir: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.6_b=0.9
Process log file: collections/manner/results/dev1/tuning/bm25tune_text_text/bm25tune_k1=1.6_b=0.9/exper.log
Waiting for 1 child processes
Process with pid=25147 finished successfully.
Parsed experiment parameters:
experSubdir:tuning/bm25tune_text_text/bm25tune_k1=0.4_b=1
extrType:exper_desc/bm25tune_text_text/bm25tune_k1=0.4_b=1.json
testOnly:1
modelFinal:exper_desc/

By default, experiments are run in the background: In fact, there
can be more than one experiment run. However, for debugging purposes,
one can run experiments in the foreground by specifying the
option `-no_separate_shell`.

Furthermore, the script `scripts/exper/run_experiments.sh` has a number of parameters,
which might be worth tweaking.
In particular, for "shallow" relevance pools, one
can use default number of candidates (which is small).
However, for queries with a lot of relevance judgments,
it makes sense to slightly increase the number of top candidate
entries that are used to obtain a fusion model 
(parameter ``-train_cand_qty``).

Now, let us obtain experimental results and find the best configuration 
with respect to the Mean Average Precision (MAP), which should be nearly equal to ``0.1165``:

In [26]:
!scripts/report/get_exper_results.sh \
  manner \
  exper_desc/bm25tune_text_text.json \
  bm25tune.tsv \
  -test_part dev1 \
  -flt_cand_qty 250 \
  -print_best_metr map

Using collection root: collections
Including only runs that generated 250 candidate records
Best results for metric map:
Value: 0.116500
Result sub-dir: tuning/bm25tune_text_text/bm25tune_k1=0.4_b=0.7


## Tuning a fusion of IBM Model 1 and BM25

IBM Model 1 has quite a few parameters and can benefit from tuning as well.
Rather than tuning IBM Model 1 alone, we tune its fusion with the BM25 score for the field
`text`. Here we use optimal BM25 coefficients __obtained in the previous experiment__.
Model 1 descriptors are going to be created for the field `text_bert_tok`:

In [19]:
!scripts/gen_exper_desc/gen_model1_exper_json_desc.py \
  -k1 0.4 -b 0.7  \
  --query_field_name text_bert_tok \
  --index_field_name text_bert_tok \
  --outdir collections/manner/exper_desc/ \
  --rel_desc_path exper_desc

Namespace(b=0.7, exper_subdir='feat_exper', index_field_name='text_bert_tok', k1=0.4, outdir='collections/manner/exper_desc/', query_field_name='text_bert_tok', rel_desc_path='exper_desc')


Now we can run tuning experiments where we train on `train` and test on `dev1`:

In [None]:
!scripts/exper/run_experiments.sh \
  manner \
  exper_desc/model1tune_text_bert_tok_text_bert_tok.json \
  -test_part dev1

Obtaining the best configuration with respect to MAP:

In [21]:
!scripts/report/get_exper_results.sh \
  manner \
  exper_desc/model1tune_text_bert_tok_text_bert_tok.json \
  model1_text_bert_tok_tune.tsv \
  -test_part dev1 \
  -flt_cand_qty 250 \
  -print_best_metr map

Using collection root: collections
Including only runs that generated 250 candidate records
Best results for metric map:
Value: 0.137500
Result sub-dir: feat_exper/model1tune_text_bert_tok_text_bert_tok/bm25=text+model1=text_bert_tok+lambda=0.3+probSelfTran=0.15


## Tuning RM3

RM3 component is a pseudo-relevance feedback via re-ranking.
The whole process is quite similar to BM25 tuning descriptors:

In [33]:
!scripts/gen_exper_desc/gen_rm3_exper_json_desc.py \
  -k1 0.4 -b 0.7  \
  --index_field_name text \
  --query_field_name text \
  --outdir collections/manner/exper_desc/ \
  --exper_subdir tuning \
  --rel_desc_path exper_desc

Namespace(b=0.7, exper_subdir='tuning', index_field_name='text', k1=0.4, outdir='collections/manner/exper_desc/', query_field_name='text', rel_desc_path='exper_desc')


Now we can run tuning experiments where we train on `train` and test on `dev1`:

In [None]:
!scripts/exper/run_experiments.sh \
  manner \
  exper_desc/rm3tune_text_text.json \
  -test_part dev1

Obtaining the best configuration:

In [6]:
!scripts/report/get_exper_results.sh \
  manner \
  exper_desc/rm3tune_text_text.json \
  bm25tune.tsv \
  -test_part dev1 \
  -flt_cand_qty 250 \
  -print_best_metr map

Using collection root: collections
Including only runs that generated 250 candidate records
Best results for metric map:
Value: 0.115600
Result sub-dir: tuning/rm3tune_text_text/rm3=text+text_origWeight=0.9_topDocQty=1_topTermQty=2_k1=0.4_0.7
