# Training a neural model: vanilla BERT ranker

### Two things to do before we start:
1. Point environment variable `COLLECT_ROOT` to the collection root.
2. Change directory to the location of installed scripts/binaries

In [1]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [2]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


Training requires exporting data in the format (with a slight modification) of the 
CEDR framework ([MacAvaney et al' 2019](https://github.com/Georgetown-IR-Lab/cedr)).

The following command
generates training data in the CEDR format for the collection `wikipedia_dpr_nq_sample`
and the field `text_raw`. The traing data is generated from the split `bitext`, 
whereas split `dev` is used to generate validation data. During export, one can generate negatives of three types: hard (top-K entries), medium (top-K sample), and easy (sampled from the whole collection). Typically, hard and easy negatives are not particularly useful:

In [5]:
!./export_train/export_cedr.sh \
  wikipedia_dpr_nq_sample \
  text_raw \
  bitext \
  dev \
  -out_subdir cedr_train/text_raw \
  -cand_train_qty 50 \
  -cand_test_qty 50 \
  -thread_qty 4 \
  -hard_neg_qty 0 \
  -sample_easy_neg_qty 0 \
  -sample_med_neg_qty 3 \
  -max_num_query_test 5000 \
  -cand_prov lucene \
  -cand_prov_add_conf exper_desc.best/lucene.json

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
Train split: bitext
Eval split: dev
Random seed: 0
Output directory: /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/derived_data/cedr_train/text_raw/
# of threads: 4
Index export field: text_raw
Query export field: text_raw
Candidate provider parameters:  -cand_prov "lucene" -u "lucene_index"  -cand_prov_add_conf "exper_desc.best/lucene.json"
Resource parameters: -collect_dir /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample  -fwd_index_dir forward_index/ -embed_dir derived_data/embeddings/ -model1_dir derived_data/giza 
A # of hard/medium/easy samples per query: 0/3/0
A max. # of candidate records to generate training data: 50
A max. # of candidate records to generate test data: 50
Max train query # param.: 
Max test/dev query # param.:  -max_num_query_test 5000 
Case handling param: 
JAVA_OPTS=-Xms4117329k -server
[main] INFO edu

To train the model we can use a wrapper convenience script that reads most parameters from a configuration file. 

Note that the following ``train_model.sh`` scripts assumes that the training data path is **relative** to the ``derived_data`` subdirectory while other paths are **relative** to the collection root. The training script has a number of options (check them out by running with the option ``-h``). Here is how one can run a training script (remember this requires a GPU and pytorch with CUDA support). By default the script validates after each epoch, but this behavior can be changed:

In [None]:
!./train_nn/train_model.sh \
    wikipedia_dpr_nq_sample \
    cedr_train/text_raw \
     vanilla_bert \
     -seed 0 \
     -add_exper_subdir todays_experiment \
     -json_conf  model_conf/vanilla_bert.json \
     -epoch_qty 3

The scripts runs, both training and evaluation. The respective statistics is stored in a JSON file:

In [None]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/base/0/train_stat.json 

## It is possible to train a neural model in a fusion mode.

Here, we optimize for the neural model score fused with the score of a candidate generator. This requires knowing a good weight for the candidate generator score. 
Here, we assum that the score 1.0 is good enough and export data as shown in the next cell. Please, note the parameter `cand_train_4pos_qty`, which controls the depth of the pool from which we select positive examples. We normally want this pool to be larger than the pool from which we select negative examples:

In [None]:
!./export_train/export_cedr_with_scores.sh \
  wikipedia_dpr_nq_sample \
  text_raw \
  bitext \
  dev \
  -out_subdir cedr_train_with_scores/text_raw \
  -cand_train_qty 50 \
  -cand_test_qty 50 \
  -cand_train_4pos_qty 1000 \
  -thread_qty 4 \
  -hard_neg_qty 0 \
  -sample_easy_neg_qty 0 \
  -sample_med_neg_qty 3 \
  -max_num_query_test 5000 \
  -cand_prov lucene \
  -cand_prov_add_conf exper_desc.best/lucene.json

__Importantly__ to train a model we:
1. Use a different configuration file (`model_conf/vanilla_bert_with_scores.json`) that sets candidate provider weights to be non-zero.
2. Newly generated training data that exports scores (`cedr_train_with_scores/text_raw`).

In [None]:
!./train_nn/train_model.sh \
    wikipedia_dpr_nq_sample \
    cedr_train_with_scores/text_raw \
     vanilla_bert \
     -seed 0 \
     -add_exper_subdir todays_experiment_with_scores \
     -json_conf  model_conf/vanilla_bert_with_scores.json \
     -epoch_qty 2

The training and testing statistics can be found in this JSON:

In [7]:
!cat $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/0/train_stat.json 

{
    "0": {
        "loss": 0.35068941748553767,
        "score": 0.6587945500134988,
        "lr": 0.0002,
        "bert_lr": 2e-05,
        "train_time": 3100.7810344696045,
        "validation_time": 1031.9662160873413
    },
    "1": {
        "loss": 0.25856785646238056,
        "score": 0.6723500292783949,
        "lr": 0.00019,
        "bert_lr": 1.9e-05,
        "train_time": 3135.6423485279083,
        "validation_time": 1037.5510256290436
    }
}