# Experimentation: Testing models that do not require training a fusion model (no learning to rank)

### First we need to move to the top-level directory ...

In [None]:
cd ../..

## Testing BM25
We use optimal BM25 parameters obtained during tuning:

In [2]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc.best/bm25.json \
  -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/bm25.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/bm25
candProvAddConfParam:exper_desc.best/lucene.json
extrType:exper_desc.best/extractors/bm25.json
modelFinal:exper_desc.best/models/one_feat.model
testOnly:1
Experimental directory already exists (ignoring): collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25
Waiting for 0 child processes
0 experiments executed
0 experiments failed


All the results are available in the directory `collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25`. 

The following is a summary report (top-100):

In [3]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/bm25/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.400400
NDCG@20:  0.433900
NDCG@100: 0.507400
P20:      0.164800
MAP:      0.346200
MRR:      0.487300
Recall:   0.817879


## Testing dense retrieval (ANCE) in the re-ranking mode

In [4]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc.best/ance.json \
  -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/ance.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/ance
candProvAddConfParam:exper_desc.best/lucene.json
extrType:exper_desc.best/extractors/ance.json
modelFinal:exper_desc.best/models/one_feat.model
testOnly:1
Experimental directory already exists (ignoring): collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance
Waiting for 0 child processes
0 experiments executed
0 experiments failed


To-100 report:

In [5]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/ance/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.649200
NDCG@20:  0.651800
NDCG@100: 0.692400
P20:      0.152300
MAP:      0.555200
MRR:      0.865100
Recall:   0.639348


## Testing dense retrieval (averaged glove embeddings) in the re-ranking mode

In [7]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc.best/avgembed.json \
  -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        8
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/avgembed.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 8
Parsed experiment parameters:
experSubdir:final_exper/avgembed
candProvAddConfParam:exper_desc.best/lucene.json
extrType:exper_desc.best/extractors/avgembed.json
modelFinal:exper_desc.best/models/one_feat.model
testOnly:1
Started a process 4502, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/avgembed
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/avgembed/exper.log
Waiting for 1 child processes
Process with pid=4502 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 

To-100 report:

In [9]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/avgembed/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.144400
NDCG@20:  0.157400
NDCG@100: 0.216500
P20:      0.067500
MAP:      0.101900
MRR:      0.225200
Recall:   0.419530


## Testing a BERT ranking model
One needs to start a query server that binds to the port 8080 as shown below. This needs to be done in __a separate terminal__, because notebooks do not support background processes. Please, note we have to specify __the same maximum query and document lengths__ as during the training process.

```
scripts/py_featextr_server/cedr_server.py  \
   --init_model collections/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/model.best \
   --max_query_len 64 \
   --max_doc_len 445 \
   --port 8080
```

Note that we ask to re-rank only 50 candidates. The ranking of candidates below 50th position will not change.

In [8]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc.best/cedr8080.json \
  -thread_qty 2 \
  -max_final_rerank_qty 50 \
  -test_part dev

Using collection root: collections
The number of CPU cores:      8
The number of || experiments: 1
The number of threads:        2
Experiment descriptor file:                                 collections/wikipedia_dpr_nq_sample/exper_desc.best/cedr8080.json
Default test set:                                           dev
Number of parallel experiments:                             1
Number of threads in feature extractors/query applications: 2
Parsed experiment parameters:
experSubdir:final_exper/cedr8080
extrType:exper_desc.best/extractors/cedr8080.json
modelFinal:exper_desc.best/models/one_feat.model
testOnly:1
Started a process 10654, working dir: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/cedr8080
Process log file: collections/wikipedia_dpr_nq_sample/results/dev/final_exper/cedr8080/exper.log
Waiting for 1 child processes
Process with pid=10654 finished successfully.
Waiting for 0 child processes
1 experiments executed
0 experiments failed


Top-100 report:

In [9]:
!cat collections/wikipedia_dpr_nq_sample/results/dev/final_exper/cedr8080/rep/out_100.rep

# of queries:    2500
NDCG@10:  0.560500
NDCG@20:  0.575100
NDCG@100: 0.619300
P20:      0.188900
MAP:      0.487700
MRR:      0.671100
Recall:   0.808288
