# Simple notebook for training and evaluating models

Two types of models are trained, but both derive from a single initial Electra-small model trained on the SNLI train dataset. Following this, the two basic models are:
* One fine-tuned on the SNLI dev dataset using gold labels and evaluated on the SNLI test dataset modified to contain a probility distribution of the annotator labels, and
* One fine-tuned on the SNLI dev dataset modified to contain probility distributions of the annotator labels and evaluated on the SNLI test dataset modified to contain probility distributions of the annotator labels

In [6]:
ROOT_DATA_PATH = '../data/snli_1.0'
ROOT_MODEL_PATH = './output_final/trained_model_snli_baseline'

## Train the model using `train` dataset with gold labels

In [12]:
!python run.py --do_train \
      --task nli \
      --per_device_train_batch_size 200 \
      --dataset ../data/snli_1.0/snli_1.0_train_prepared.jsonl \
      --output_dir $ROOT_MODEL_PATH/trained_train_gold

  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 209.44it/s]
Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/elect

## Train/fine-tune the model using `dev` dataset with gold labels

In [13]:
!python run.py --do_train \
      --task nli \
      --per_device_train_batch_size 200 \
      --dataset ../data/snli_1.0/snli_1.0_dev_prepared.jsonl \
      --output_dir $ROOT_MODEL_PATH/trained_train_gold_dev_gold \
      --model $ROOT_MODEL_PATH/trained_train_gold

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-3a4e6c19b1b5fa0b/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...
Downloading data files: 100% 1/1 [00:00<00:00, 2113.00it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 72.76it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-3a4e6c19b1b5fa0b/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.
100% 1/1 [00:00<00:00, 754.10it/s]
Preprocessing data... (this takes a little bit, should only happen once per dataset)
#0:   0% 0/5 [00:00<?, ?ba/s]
#1:   0% 0/5 [00:00<?, ?ba/s][A
#0:  20% 1/5 [00:00<00:01,  3.69ba/s]
#0:  40% 2/5 [00:00<00:00,  4.50ba/s]
#0:  60% 3/5 [00:00<00:00,  4.72ba/s]
#1:  80% 4/5 [00:00<00:00,  4.02ba/s]
#0:  80% 4/5 [00:01<00:00,  3.92ba/s]
***** Running training *****
  Num examples = 9842
  Num Epochs = 3
  Instantaneous batch size per device =

## Train/fine-tune the model using `dev` dataset with probabity labels

In [14]:
!python run.py --do_train \
      --task pnli \
      --per_device_train_batch_size 200 \
      --dataset ../data/snli_1.0/snli_1.0_dev_probs.jsonl \
      --output_dir $ROOT_MODEL_PATH/trained_train_gold_dev_probs \
      --model $ROOT_MODEL_PATH/trained_train_gold

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-67abd46323a82e7a/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...
Downloading data files: 100% 1/1 [00:00<00:00, 2048.00it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 94.08it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-67abd46323a82e7a/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.
100% 1/1 [00:00<00:00, 671.95it/s]
Preprocessing data... (this takes a little bit, should only happen once per dataset)
#0:   0% 0/5 [00:00<?, ?ba/s]
#0:  20% 1/5 [00:00<00:00,  4.47ba/s]
#0:  40% 2/5 [00:00<00:00,  4.66ba/s]
#0:  60% 3/5 [00:00<00:00,  4.69ba/s]
#0:  80% 4/5 [00:00<00:00,  4.89ba/s]
#0:  80% 4/5 [00:01<00:00,  3.93ba/s]
#1:  80% 4/5 [00:01<00:00,  3.91ba/s]
***** Running training *****
  Num examples = 9842
  Num Epochs = 3
  Instantaneous batch size per dev

## Evaluate the train gold, dev gold model using `test` dataset with probabilities

In [17]:
!python run.py --do_eval \
      --task pnli \
      --per_device_train_batch_size 200 \
      --dataset ../data/snli_1.0/snli_1.0_test_probs.jsonl \
      --output_dir $ROOT_MODEL_PATH/trained_train_gold_dev_gold_evaled_test_probs \
      --model $ROOT_MODEL_PATH/trained_train_gold_dev_gold

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-daaa4f8822524b32/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...
Downloading data files: 100% 1/1 [00:00<00:00, 1943.61it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 107.83it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-daaa4f8822524b32/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.
100% 1/1 [00:00<00:00, 687.48it/s]
Preprocessing data... (this takes a little bit, should only happen once per dataset)
#0:   0% 0/5 [00:00<?, ?ba/s]
#0:  20% 1/5 [00:00<00:00,  4.32ba/s]
#0:  40% 2/5 [00:00<00:00,  4.33ba/s]
#0:  60% 3/5 [00:00<00:00,  4.62ba/s]
#0:  80% 4/5 [00:00<00:00,  4.83ba/s]
#0:  80% 4/5 [00:01<00:00,  3.89ba/s]
#1:  80% 4/5 [00:01<00:00,  3.82ba/s]
***** Running Evaluation *****
  Num examples = 9824
  Batch size = 8
You're using a ElectraTokenizer

## Evaluate the train gold, dev probabilities model using `test` dataset with probabilities

In [21]:
!python run.py --do_eval \
      --task pnli \
      --per_device_train_batch_size 200 \
      --dataset ../data/snli_1.0/snli_1.0_test_probs.jsonl \
      --output_dir $ROOT_MODEL_PATH/trained_train_gold_dev_probs_evaled_test_probs \
      --model $ROOT_MODEL_PATH/trained_train_gold_dev_probs

  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 638.99it/s]
Preprocessing data... (this takes a little bit, should only happen once per dataset)
#0:   0% 0/5 [00:00<?, ?ba/s]
#1:   0% 0/5 [00:00<?, ?ba/s][A
#0:  20% 1/5 [00:00<00:01,  2.03ba/s]
#0:  40% 2/5 [00:00<00:01,  2.51ba/s]
#0:  60% 3/5 [00:01<00:00,  2.50ba/s]
#1:  80% 4/5 [00:01<00:00,  2.17ba/s]
#0:  80% 4/5 [00:01<00:00,  2.04ba/s]
***** Running Evaluation *****
  Num examples = 9824
  Batch size = 8
You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100% 1228/1228 [00:15<00:00, 77.95it/s]
Evaluation results:
{'eval_loss': 0.5234462022781372, 'eval_accuracy': 0.8907777070999146, 'eval_runtime': 16.4102, 'eval_samples_per_second': 598.652, 'eval_steps_per_second': 74.831}
