# Training a neural model: vanilla BERT ranker

### First we need to move to the top-level directory.

In [None]:
cd ../..

Training requires exporting data in the format (with a slight modification) of the 
CEDR framework ([MacAvaney et al' 2019](https://github.com/Georgetown-IR-Lab/cedr)).

The following command
generates training data in the CEDR format for the collection `wikipedia_dpr_nq_sample`
and the field `text_raw`. The traing data is generated from the split `bitext`, 
whereas split `dev` is used to generate validation data. During export, one can generate negatives of three types: hard (top-K entries), medium (top-K sample), and easy (sampled from the whole collection). Typically, hard and easy negatives are not particularly useful:

In [9]:
!scripts/export_train/export_cedr.sh \
  wikipedia_dpr_nq_sample \
  text_raw \
  bitext \
  dev \
  -out_subdir cedr_train/text_raw \
  -cand_train_qty 50 \
  -cand_test_qty 50 \
  -thread_qty 4 \
  -hard_neg_qty 0 \
  -sample_easy_neg_qty 0 \
  -sample_med_neg_qty 3 \
  -max_num_query_test 5000 \
  -cand_prov lucene \
  -cand_prov_add_conf exper_desc.best/lucene.json

Using collection root: collections
Train split: bitext
Eval split: dev
Random seed: 0
Output directory: collections/wikipedia_dpr_nq_sample/derived_data/cedr_train/text_raw/
# of threads: 4
Index export field: text_raw
Query export field: text_raw
Candidate provider parameters:  -cand_prov "lucene" -u "collections/wikipedia_dpr_nq_sample/lucene_index"  -cand_prov_add_conf "collections/wikipedia_dpr_nq_sample/exper_desc.best/lucene.json"
Resource parameters: -fwd_index_dir "collections/wikipedia_dpr_nq_sample/forward_index/" -embed_dir "collections/wikipedia_dpr_nq_sample/derived_data/embeddings/" -giza_root_dir "collections/wikipedia_dpr_nq_sample/derived_data/giza" 
A # of hard/medium/easy samples per query: 0/3/0
A max. # of candidate records to generate training data: 50
A max. # of candidate records to generate test data: 50
Max train query # param.: 
Max test/dev query # param.:  -max_num_query_test 5000 
Case handling param: 
JAVA_OPTS=-Xms16469316k -Xmx28821303k -server
[main] I

To train the model we can use a wrapper convenience script that reads most parameters from a configuration file. 

Note that the following ``train_model.sh`` scripts assumes that the training data path is **relative** to the ``derived_data`` subdirectory while other paths are **relative** to the collection root. The training script has a number of options (check them out by running with the option ``-h``). Here is how one can run a training script (remember this requires a GPU and pytorch with CUDA support). By default the script validates after each epoch, but this behavior can be changed:

In [None]:
!scripts/cedr/train_model.sh \
    wikipedia_dpr_nq_sample \
    cedr_train/text_raw \
     vanilla_bert \
     -seed 0 \
     -add_exper_subdir todays_experiment \
     -json_conf  model_conf/vanilla_bert.json \
     -epoch_qty 4

The scripts runs, both training and evaluation. The respective statistics is stored in a JSON file:

In [11]:
!cat collections/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/base/0/train_stat.json 

{
    "0": {
        "loss": 0.3480607439222474,
        "score": 0.6714521193722517,
        "lr": 0.0002,
        "bert_lr": 2e-05,
        "train_time": 4923.056172847748,
        "validation_time": 1545.6733376979828
    },
    "1": {
        "loss": 0.25530940150896,
        "score": 0.672424202490772,
        "lr": 0.00019,
        "bert_lr": 1.9e-05,
        "train_time": 4912.885874032974,
        "validation_time": 1548.907298564911
    },
    "2": {
        "loss": 0.22298829897231495,
        "score": 0.6784664800502341,
        "lr": 0.0001805,
        "bert_lr": 1.805e-05,
        "train_time": 4920.855644226074,
        "validation_time": 1547.028527021408
    },
    "3": {
        "loss": 0.19422185843908618,
        "score": 0.6758941456344162,
        "lr": 0.00017147499999999998,
        "bert_lr": 1.71475e-05,
        "train_time": 4920.853113651276,
        "validation_time": 1546.0641107559204
    }
}

## It is possible to train a neural model in a fusion mode.

Here, we optimize for the neural model score fused with the score of a candidate generator. This requires knowing a good weight for the candidate generator score. 
Here, we assum that the score 1.0 is good enough and export data as shown in the next cell. Please, note the parameter `cand_train_4pos_qty`, which controls the depth of the pool from which we select positive examples. We normally want this pool to be larger than the pool from which we select negative examples:

In [None]:
!scripts/export_train/export_cedr_with_scores.sh \
  wikipedia_dpr_nq_sample \
  text_raw \
  bitext \
  dev \
  -out_subdir cedr_train_with_scores/text_raw \
  -cand_train_qty 50 \
  -cand_test_qty 50 \
  -cand_train_4pos_qty 1000 \
  -thread_qty 4 \
  -hard_neg_qty 0 \
  -sample_easy_neg_qty 0 \
  -sample_med_neg_qty 3 \
  -max_num_query_test 5000 \
  -cand_prov lucene \
  -cand_prov_add_conf exper_desc.best/lucene.json

__Importantly__ to train a model we:
1. Use a different configuration file (`model_conf/vanilla_bert_with_scores.json`) that sets candidate provider weights to be non-zero.
2. Newly generated training data that exports scores (`cedr_train_with_scores/text_raw`).

In [None]:
!scripts/cedr/train_model.sh \
    wikipedia_dpr_nq_sample \
    cedr_train_with_scores/text_raw \
     vanilla_bert \
     -seed 0 \
     -add_exper_subdir todays_experiment_with_scores \
     -json_conf  model_conf/vanilla_bert_with_scores.json \
     -epoch_qty 4

The training and testing statistics can be found in this JSON:

In [15]:
!cat collections/wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment_with_scores/base/0/train_stat.json 

{
    "0": {
        "loss": 0.3928542813032928,
        "score": 0.672083845882564,
        "lr": 0.0002,
        "bert_lr": 2e-05,
        "train_time": 4668.770467042923,
        "validation_time": 1543.5844690799713
    },
    "1": {
        "loss": 0.2927950551674534,
        "score": 0.6839066897667362,
        "lr": 0.00019,
        "bert_lr": 1.9e-05,
        "train_time": 4698.543385028839,
        "validation_time": 1547.0706918239594
    },
    "2": {
        "loss": 0.2543279381992271,
        "score": 0.6847354303138471,
        "lr": 0.0001805,
        "bert_lr": 1.805e-05,
        "train_time": 4704.753187894821,
        "validation_time": 1544.000947713852
    },
    "3": {
        "loss": 0.2291002025931528,
        "score": 0.68495924843781,
        "lr": 0.00017147499999999998,
        "bert_lr": 1.71475e-05,
        "train_time": 4703.747788667679,
        "validation_time": 1542.5429928302765
    }
}