# Experimentation: Tuning BM25 & Model 1

Experiments are described using simple eperimental descriptors, which we will store in the directory `collections/wikipedia_dpr_nq_sample/exper_desc`. The provided tar-ball with data has some key experimental descriptors. However, it does not have descriptors to tune BM25 and the fusion of BM25 with other models.

### First we need to move to the top-level directory ...

In [1]:
cd ../..

/home/leo/SourceTreeGit/FlexNeuART.refact2021


### ... and create a directory to store experiment descriptors:

In [2]:
!mkdir -p collections/wikipedia_dpr_nq_sample/exper_desc

## Tuning BM25
A tuning procedure simply executes a number of descriptor files
with various BM25 parameters. To create descriptors one runs:

In [3]:
!scripts/gen_exper_desc/gen_bm25_tune_json_desc.py \
  --index_field_name text \
  --query_field_name text \
  --outdir collections/wikipedia_dpr_nq_sample/exper_desc/ \
  --exper_subdir tuning \
  --rel_desc_path exper_desc

Namespace(exper_subdir='tuning', index_field_name='text', outdir='collections/wikipedia_dpr_nq_sample/exper_desc/', query_field_name='text', rel_desc_path='exper_desc')


BM25 experiments need a dummy one-feature model, which can be created like this:

In [17]:
!mkdir -p collections/wikipedia_dpr_nq_sample/exper_desc/models

In [18]:
!cp collections/wikipedia_dpr_nq_sample/exper_desc/models/one_feat.model  collections/wikipedia_dpr_nq_sample/exper_desc/models

cp: collections/wikipedia_dpr_nq_sample/exper_desc/models/one_feat.model: No such file or directory


The main experimental descriptor is going to be stored in  `collections/wikipedia_dpr_nq_sample/exper_desc/bm25tune.json`,
whereas auxiliary descriptors are stored in `collections/wikipedia_dpr_nq_sample/exper_desc/bm25tune/`

Now we can run tuning experiments where we train on `train` and test on `dev1`:

In [None]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc/bm25tune_text_text.json \
  -test_part dev

By default, experiments are run in the background: In fact, there can be more than one experiment run. However, for debugging purposes, one can run experiments in the foreground by specifying the option -no_separate_shell.

Furthermore, the script scripts/exper/run_experiments.sh has a number of parameters, which might be worth tweaking. In particular, for "shallow" relevance pools, one can use default number of candidates (which is small). However, for queries with a lot of relevance judgments, it makes sense to slightly increase the number of top candidate entries that are used to obtain a fusion model (parameter -train_cand_qty).

Now, let us obtain experimental results and find the best configuration with respect to the Mean Average Precision (MAP), which should be nearly equal to 0.3501:

In [8]:
!scripts/report/get_exper_results.sh \
  wikipedia_dpr_nq_sample \
  exper_desc/bm25tune_text_text.json \
  bm25tune.tsv \
  -test_part dev \
  -flt_cand_qty 250 \
  -print_best_metr map

Using collection root: collections
Including only runs that generated 250 candidate records
Best results for metric map:
Value: 0.350100
Result sub-dir: tuning/bm25tune_text_text/bm25tune_k1=0.6_b=0.3


## Tuning a fusion of IBM Model 1 and BM25

IBM Model 1 has quite a few parameters and can benefit from tuning as well.
Rather than tuning IBM Model 1 alone, we tune its fusion with the BM25 score for the field
`text`. Here we use optimal BM25 coefficients __obtained in the previous experiment__.
Model 1 descriptors are going to be created for the field `text_bert_tok`:

In [4]:
!scripts/gen_exper_desc/gen_model1_exper_json_desc.py \
  -k1 0.6 -b 0.3  \
  --exper_subdir tuning \
  --query_field_name text_bert_tok \
  --index_field_name text_bert_tok \
  --outdir collections/wikipedia_dpr_nq_sample/exper_desc/ \
  --rel_desc_path exper_desc

Namespace(b=0.3, exper_subdir='tuning', index_field_name='text_bert_tok', k1=0.6, outdir='collections/wikipedia_dpr_nq_sample/exper_desc/', query_field_name='text_bert_tok', rel_desc_path='exper_desc')


Now we can run tuning experiments where we train on `train_fusion` and test on `dev` (or `dev_official` if necessary):

In [None]:
!scripts/exper/run_experiments.sh \
  wikipedia_dpr_nq_sample \
  exper_desc/model1tune_text_bert_tok_text_bert_tok.json \
  -test_part dev -train_part train_fusion

Obtaining the best configuration with respect to MAP:

In [16]:
!scripts/report/get_exper_results.sh \
  wikipedia_dpr_nq_sample \
  exper_desc/model1tune_text_bert_tok_text_bert_tok.json \
  model1_text_bert_tok_tune.tsv \
  -test_part dev \
  -flt_cand_qty 250 \
  -print_best_metr map

Using collection root: collections
Including only runs that generated 250 candidate records
Best results for metric map:
Value: 0.442200
Result sub-dir: tuning/model1tune_text_bert_tok_text_bert_tok/bm25=text+model1=text_bert_tok+lambda=0.3+probSelfTran=0.35
