# Experimentation: Training & testing fusion models

### Two things to do before we start:
1. Point environment variable `COLLECT_ROOT` to the collection root.
2. Change directory to the location of installed scripts/binaries

In [1]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [2]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


## Training a fusion of BM25 and Model1 using the optimal configuration obtained during the fine-tuning step

Training uses only the first 5000 queries from the fusion set:

In [3]:
!./exper/run_experiments.sh \
   wikipedia_dpr_nq_sample \
   exper_desc.best/bm25_model1.json \
   -max_num_query_train 5000 \
   -train_cand_qty 20 \
   -test_part dev

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/bm25_model1.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:feat_exper/bm25_model1
extrTypeFinal:exper_desc.best/extractors/bm25=text+model1=text_bert_tok+lambda=0.3+probSelfTran=0.35.json
testOnly:0
Started a process 9538, working dir: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_model1
Process log file: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_model1/exper.log
Waiting for 1 child processes
Process with pid=9538 finished successfully.
Wait

Top-100 report:

In [4]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_model1/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.499600
NDCG@20:  0.535100
NDCG@100: 0.598100
P20:      0.199400
MAP:      0.439000
MRR:      0.584700
Recall:   0.889732


## Training a fusion of (query-normalized) BM25 and BERT-model scores
One needs to start a query server that binds to the port 8080 as shown below. This needs to be done in __a separate terminal__, because notebooks do not support background processes. Please, note we have to specify __the same maximum query and document lengths__ as during the training process.

```
COLLECT_ROOT=/home/leo/flexneuart_collections

./featextr_server/nn_rank_server.py  \
   --init_model $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/model.best \
   --port 8080
```

Now we can run an experiment by training using a `train_fusion` subset of queries and testing on the `dev` subset. Please, note the following:
1. During training time use 20 candidates, but for testing on `dev` we re-rank 50 candidates. The ranking of candidates below 50th position will not change.
2. We use two threads and output log to the screen (i.e., the process is no started in a separate shell).
3. Training uses only the __first 5000__ queries from the fusion set.

In [8]:
!./exper/run_experiments.sh \
   wikipedia_dpr_nq_sample \
   exper_desc.best/bm25_cedr8080.json \
   -max_num_query_train 5000 \
   -train_cand_qty 20 \
   -max_final_rerank_qty 50 \
   -test_part dev \
   -thread_qty 2 \
   -no_separate_shell

Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        2
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/bm25_cedr8080.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 2
Parsed experiment parameters:
experSubdir:feat_exper/bm25_cedr8080
extrTypeFinal:exper_desc.best/extractors/bm25_cedr8080.json
testOnly:0
Experimental directory already exists (ignoring): /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_cedr8080
Waiting for 0 child processes
0 experiments executed
0 experiments failed


All the results are available in the directory `collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_cedr8080`.

The following is a summary report (top-100):

In [7]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_cedr8080/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.579000
NDCG@20:  0.590100
NDCG@100: 0.630900
P20:      0.192400
MAP:      0.504800
MRR:      0.687700
Recall:   0.808185


## Training a fusion of BM25 and dense-embeddings (ANCE)

In [9]:
!./exper/run_experiments.sh \
   wikipedia_dpr_nq_sample \
   exper_desc.best/bm25_ance.json \
   -max_num_query_train 5000 \
   -train_cand_qty 20 \
   -test_part dev


Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/bm25_ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:feat_exper/bm25_ance
candProvAddConf:exper_desc.best/lucene.json
extrTypeFinal:exper_desc.best/extractors/bm25_ance.json
testOnly:0
Started a process 10057, working dir: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_ance
Process log file: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_ance/exper.log
Waiting for 1 child processes
Process with pid=10057 finished successfully.
Waiting for 0 ch

Top-100 results:

In [10]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_ance/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.655400
NDCG@20:  0.657200
NDCG@100: 0.698200
P20:      0.155500
MAP:      0.561800
MRR:      0.865900
Recall:   0.652528


## Training a fusion of BM25 and dense-embeddings (averaged glove embeddings)

In [11]:
!./exper/run_experiments.sh \
   wikipedia_dpr_nq_sample \
   exper_desc.best/bm25_avgembed.json \
   -max_num_query_train 5000 \
   -train_cand_qty 20 \
   -test_part dev


Using collection root: /home/leo/flexneuart_collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/exper_desc.best/bm25_avgembed.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:feat_exper/bm25_avgembed
candProvAddConf:exper_desc.best/lucene.json
extrTypeFinal:exper_desc.best/extractors/bm25_avgembed.json
testOnly:0
Started a process 10252, working dir: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_avgembed
Process log file: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_avgembed/exper.log
Waiting for 1 child processes
Process with pid=10252 finished successful

Top-100 results:

In [13]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/results/dev/feat_exper/bm25_avgembed/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.402400
NDCG@20:  0.434900
NDCG@100: 0.507700
P20:      0.165600
MAP:      0.345700
MRR:      0.486800
Recall:   0.819776


## Notes on efficient feature generation
1. Generating certain features, in particular, scores from large models is expensive. However, these features are not always useful. For example, you can train two similar models. Combining them in the ensemble can lead to no improvement and even cause some minor degradation. 
2. Furthermore, the outcomes of applying the coordinate ascent model is affected by randomness. Running  the full training and testing pipeine using different sets of features can be extremely expensive.
3. This can be avoided with some effort. To this end, one needs to set the flag `trainOnly` in the experimental descriptor (see above documentation) to `true` and run the experimental pipeline twice using **different** values of parameters `-train_cand_qty` and `-train_part`. In the first case, you specify your actual training set. In the second case, the option `-train_part` should point to your test/validation set. Usually, I use a smaller number of candidates (`-train_cand_qty`) when I generate features for the training part.
4. Note that in the training-only model, there is no need to specify the test part.
4. As a result, you will have two sets of features that can be used for training and validation. Use logs to locate these features (eventually then end up being in the sub-directory `letor`).
5. Now one can use the RankLib library directly. Check out the options of this library by executing:
```
java -jar lib/RankLib.jar 
```