<a href="https://colab.research.google.com/github/patpizio/vennabers-for-nlu/blob/main/ivap_nlu_copa22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calibration of NLU Models with Venn--ABERS Predictors

This notebook reproduces the steps used in [Giovannotti (2022)](https://arxiv.org/abs/2205.10586). 

In [None]:
!python --version

Python 3.7.13


Among the other libraries, we will install `reliabots`, where I implemented some conformal prediction algorithms.

In [None]:
!pip install -U torch
!pip install -U transformers datasets sentencepiece
!pip install -U plotly kaleido statsmodels
!pip install -U reliabots
!pip install pyyaml==5.4.1

In [None]:
from reliabots.icp import ConformalPredictor
from reliabots.ivap import IVAP
import reliabots.calibrutils as cu

import csv, codecs
from tqdm.notebook import tqdm
from pprint import pprint
import json, pickle
import numpy as np
from scipy.special import softmax
import torch
from torch.utils.data import DataLoader
from datasets import Features, Value, Sequence, ClassLabel, DatasetDict, Dataset
from datasets import load_dataset, load_metric, set_caching_enabled, concatenate_datasets
from transformers import set_seed
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import AdamW
from sklearn.metrics import classification_report, matthews_corrcoef, accuracy_score, precision_recall_fscore_support, f1_score 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.calibration import calibration_curve
import plotly.express as px
import pandas as pd

  defaults = yaml.load(f)


In [None]:
print(torch.cuda.get_device_name(0))

Tesla T4


In [None]:
device = torch.device('cuda')
!nvidia-smi

Wed Jul  6 12:20:40 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Here we choose which pretrained model to use among the four used in the paper. We could add many others from [the ones provided by Huggingface](https://huggingface.co/transformers/v3.3.1/pretrained_models.html).

In [None]:
model_checkpoint = 'bert-base-uncased'
# model_checkpoint = 'roberta-base'
# model_checkpoint = 'albert-base-v2'
# model_checkpoint = 'microsoft/deberta-v3-small'

Then we choose a dataset:

In [None]:
# dataset = 'qqp'
# dataset = 'boolq'
dataset = 'cola'
# dataset = 'sst'

In [None]:
if dataset in ['cola', 'qqp']:
    data = load_dataset('glue', dataset)
elif dataset == 'boolq':
    data = load_dataset('super_glue', dataset)
else:
    data = load_dataset(dataset)

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Our initial dataset is a dictionary with a `train`, a `test` and a `validation` section. 

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

Let's have a look at the test labels. These are all set to `-1` if the dataset is being used in ongoing competitions.

In [None]:
data['test'][1]

{'idx': 1, 'label': -1, 'sentence': 'The car honked its way down the road.'}

so we may need to create a new test set from the `train` + `validation` bits, and shuffle the result just to be sure our data points are i.i.d.

### Shuffle the dataset

In [None]:
test_set_size = {'qqp':40430, 'cola':1063, 'boolq':1635}  # chosen to match validation set sizes

In our work , SST is the only dataset that provides test labels.

In [None]:
if dataset not in ['sst']:
    data = concatenate_datasets([data['train'], data['validation']])
    data = data.train_test_split(test_size=test_set_size[dataset], seed=1986)
    aux_data = data['train'].train_test_split(test_size=test_set_size[dataset], seed=1986)
    data['train'] = aux_data['train']
    data['validation'] = aux_data['test']
data

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 7468
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

However, SST comes with `float` labels that span the whole $[0,1]$ interval: so we turn them into binary labels by rounding to the closest integer:

In [None]:
if dataset == 'sst':
    data = data.map(lambda example: {'label': int(round(example['label']))})
    new_features = data['train'].features.copy()
    new_features['label'] = ClassLabel(num_classes=2)
    data = data.cast(new_features)

In [None]:
data['train'].features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], id=None),
 'sentence': Value(dtype='string', id=None)}

## Tokenization

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

We tokenize the whole dataset. Each dataset uses its own column naming, so we create a dictionary to handle each configuration.

In [None]:
col_names = {
    'boolq':{
        1: 'question',
        2: 'passage'
    },
    'qqp':{
        1: 'question1',
        2: 'question2'
    },
    'sst': 'sentence',
    'cola': 'sentence'
}

In [None]:
if dataset == 'boolq':  # boolq's examples are usually longer, we need more space
    max_length = 200
else:
    max_length = 100

In [None]:
if dataset in ['cola', 'sst']:
    print('One sentence per example.')
    data = data.map(lambda x: tokenizer(
                                        x[col_names[dataset]],
                                        padding=True,
                                        truncation=True,
                                        max_length=max_length,
                                       ), 
                    batched=True, 
                    load_from_cache_file=True)
else:
    print('A pair of sentences per example.')
    data = data.map(lambda x: tokenizer(
                                        x[col_names[dataset][1]],
                                        x[col_names[dataset][2]],
                                        padding=True,
                                        truncation=True,
                                        max_length=max_length,
                                       ), 
                    batched=True, 
                    load_from_cache_file=True)    



One sentence per example.


  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Transform into `torch.Tensor` every column used by the transformer:

In [None]:
data.set_format(type='torch', columns=['attention_mask', 'input_ids', 'label'], output_all_columns=True)

In [None]:
data['train'][0]['input_ids']

tensor([  101,  2045,  2020,  3174,  2493,  2012,  1996,  8835,  1998,  2151,
         3076,  2040,  2001,  2045,  2056,  2009,  2001, 18988,  1012,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])

### Add "proper training" and calibration sets

Now for more slicing action. We need to partition our training set into a *proper training set* (to fine-tune the model, hence obtain our nonconformity measure) and *calibration set* (to compute the p-values).

In [None]:
cal_set_size = 0.15  # 15% of the training set, we could play with this value and see how calibration is affected

proper_train_and_cal = data['train'].train_test_split(test_size=cal_set_size, shuffle=True, seed=2020)
data['proper_train'] = proper_train_and_cal['train']
data['cal'] = proper_train_and_cal['test']

In [None]:
# data['validation'][0]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7468
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1063
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1063
    })
    proper_train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6347
    })
    cal: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1121
    })
})

### Training setup

In [None]:
num_labels = 2

In [None]:
batch_size = 16

We can compute a bunch of metrics, even if we will ultimately focus on macro F1 score

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    precision, recall, f1_macro, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    f1_weighted = f1_score(labels, preds, average='weighted')
    return {
        'accuracy': acc,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted,
        'precision': precision,
        'recall': recall
    }

In [None]:
model_checkpoint_safe = model_checkpoint.replace('/', '-')  # pesky slash on the deberta checkpoint name...
folder = dataset + '-' + model_checkpoint_safe
folder

'cola-bert-base-uncased'

The following function, used within the `Trainer` instantiation, should ensure reproducibility with respect to a certain seed

In [None]:
def model_init():
    m = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                           num_labels=num_labels,
                                                           output_attentions = False,
                                                           output_hidden_states = False,
                                                           return_dict=True 
                                                           )
    return m

Our nonconformity measure is $-z$ where $z$ is the **logit** output by the model. We also tried a $-\text{softmax}(z)$ measure but scrapped it as the difference in performance was negligible.

In [None]:
# will contain stuff like the model or the predictions, useful for plotting later on
info = {'logit':{}, 'default':{}}

This is the main training / evaluation loop. Our pretrained model is fine-tuned on the original `train` set for the default case and on the smaller `proper_train` set for the IVAP version. We then compute our evaluation metrics (F1 score and Expected Calibration Error -- the latter being handled by the `reliabots` library)

In [None]:
def train_and_evaluate(seeds, num_epochs, metric_name, conformal):
    results = {}
    for t, seed in enumerate(seeds):
        
        trial = t + 1
        print(f'\nRunning trial n.{trial} / {len(seeds)}\n')
        
        if conformal == True:
            model_name = model_checkpoint + '-vap'
            train_dataset = data['proper_train']
        else:
            model_name = model_checkpoint
            train_dataset = data['train']
            
        folder = dataset + '-' + model_name.replace('/', '-')
            
        args = TrainingArguments(
          './checkpoints/' + folder,
          evaluation_strategy='epoch',
          save_strategy='epoch',
          learning_rate=2e-5,
          per_device_train_batch_size=batch_size,
          per_device_eval_batch_size=batch_size,
          num_train_epochs=num_epochs,
          weight_decay=0.01,
          save_total_limit=1,  # save some space
          load_best_model_at_end=True,
          metric_for_best_model=metric_name,
          seed=seed
        )

        trainer = Trainer(
        #     model,
            args=args,
            train_dataset=train_dataset,
            eval_dataset=data['validation'],
            tokenizer=tokenizer,
            model_init = model_init,
            compute_metrics=compute_metrics
        )

        trainer.train()  

        y_test = data['test']['label'].numpy()
        y_test_proba = trainer.predict(data['test']).predictions
        
        if conformal:
            y_cal = data['cal']['label'].numpy()
            y_cal_proba = trainer.predict(data['cal']).predictions
            
            names = {'logit':False}
               
            for ncm in ['logit']:  # 'softmax' is not included anymore, so we could really avoid this list and loop. But what if I change my mind later on?
                icp = IVAP(y_cal, y_test, y_cal_proba, y_test_proba, apply_softmax=names[ncm])
                info[ncm][trial] = {'model': icp}                
                
                df_line = [model_name + '-' + ncm, trial]
                ivap_f1 = f1_score(y_test, np.round(info[ncm][trial]['model'].p_single).astype(int), average='macro')
                df_line.append(np.round(ivap_f1, 5))
                cal_perf = info[ncm][trial]['model'].compute_calibration_errors(num_bins=10)
                for metric in cal_perf:
                    df_line.append(np.round(cal_perf[metric], 5))
                
                results[f'{model_name}-{ncm}-{trial}'] = df_line
                
        else:
            df_line = [model_name, trial]
            f1_macro = f1_score(y_test, np.argmax(y_test_proba, axis=1), average='macro')
            df_line.append(np.round(f1_macro, 5))

            smx = torch.nn.Softmax(dim=1)
            softmaxed = smx(torch.Tensor(y_test_proba))
            scorez = softmaxed[:, 1]
            info['default'][trial] = {'scorez': scorez, 'y_test': y_test}
            cal_perf = cu.calibration_errors(scorez, y_test, num_bins=10)
            for metric in cal_perf:
                df_line.append(np.round(cal_perf[metric], 5))

            results[model_name + '-' + str(trial)] = df_line
    
    return results

More training parameters ahead. I recommend at least 2 epochs for the smaller datasets. The number of trials is instead given by the length of `seeds`.

In [None]:
n_epochs = 3
seeds = [1986, 1985, 2020, 1955, 1954]
# seeds = [1986]  # for QQP only

In [None]:
vanilla_results = train_and_evaluate(seeds=seeds, num_epochs=n_epochs, metric_name='f1_macro', conformal=False)  # vanilla as in 'unmodified', default


Running trial n.1 / 2



loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

https://huggingface.co/bert-base-uncased/resolve/main/pytorch_mo

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
creating metadata file for /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predict

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,Precision,Recall
1,No log,0.408289,0.829727,0.7656,0.818309,0.81802,0.742003
2,0.473800,0.441485,0.840075,0.784172,0.831395,0.82659,0.76214


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1063
  Batch size = 16
Saving model checkpoint to ./checkpoints/cola-bert-base-uncased/checkpoint-467
Configuration saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/config.json
Model weights saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/pytorch_model.bin
tokenizer config file saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/tokenizer_config.json
Special tokens file saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are no

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).



Running trial n.2 / 2



loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,Precision,Recall
1,No log,0.445266,0.812794,0.726118,0.792357,0.816154,0.700393
2,0.501600,0.437413,0.830668,0.765298,0.81855,0.822347,0.740677


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1063
  Batch size = 16
Saving model checkpoint to ./checkpoints/cola-bert-base-uncased/checkpoint-467
Configuration saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/config.json
Model weights saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/pytorch_model.bin
tokenizer config file saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/tokenizer_config.json
Special tokens file saved in ./checkpoints/cola-bert-base-uncased/checkpoint-467/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are no

In [None]:
ivap_results = train_and_evaluate(seeds=seeds, num_epochs=n_epochs, metric_name='f1_macro', conformal=True)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).



Running trial n.1 / 2



loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,Precision,Recall
1,No log,0.438819,0.814675,0.731354,0.795675,0.815001,0.705678
2,0.459800,0.447912,0.831609,0.769435,0.820909,0.818922,0.746296


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1063
  Batch size = 16
Saving model checkpoint to ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397
Configuration saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/config.json
Model weights saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/pytorch_model.bin
tokenizer config file saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/tokenizer_config.json
Special tokens file saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If 

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1121
  Batch size = 16
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).



Running trial n.2 / 2



loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,Precision,Recall
1,No log,0.399577,0.822201,0.768494,0.816432,0.790073,0.754601
2,0.475400,0.445612,0.829727,0.768706,0.81978,0.813018,0.746965


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1063
  Batch size = 16
Saving model checkpoint to ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397
Configuration saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/config.json
Model weights saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/pytorch_model.bin
tokenizer config file saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/tokenizer_config.json
Special tokens file saved in ./checkpoints/cola-bert-base-uncased-vap/checkpoint-397/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If 

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1121
  Batch size = 16


In [None]:
all_df = pd.DataFrame.from_dict({**vanilla_results, **ivap_results}, orient='index',
                       columns=['model', 'trial', 'f1', 'ECE', 'MCE', 'logloss', 'brier']).sort_index()
all_df

Unnamed: 0,model,trial,f1,ECE,MCE,logloss,brier
bert-base-uncased-1,bert-base-uncased,1,0.78361,0.09493,0.22568,0.47342,0.13081
bert-base-uncased-2,bert-base-uncased,2,0.76651,0.08829,0.27771,0.44593,0.1335
bert-base-uncased-vap-logit-1,bert-base-uncased-vap-logit,1,0.76935,0.02544,0.11376,0.39802,0.12469
bert-base-uncased-vap-logit-2,bert-base-uncased-vap-logit,2,0.78126,0.05527,0.16469,0.41342,0.1278


In [None]:
# this is just to take a peek - the complete dataframe above will be saved
all_df_mean = all_df.groupby('model').aggregate('mean')
all_df_mean

Unnamed: 0_level_0,trial,f1,ECE,MCE,logloss,brier
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
bert-base-uncased,1.5,0.77506,0.09161,0.251695,0.459675,0.132155
bert-base-uncased-vap-logit,1.5,0.775305,0.040355,0.139225,0.40572,0.126245


### Save to disk

Will this work in Google Drive? I surely hope so.

In [None]:
filename = f'{dataset}-{model_checkpoint_safe}'
filename

'cola-bert-base-uncased'

In [None]:
all_df.to_csv(f'./{filename}.csv')

In [None]:
pickle.dump(info, open(f'./{filename}', 'wb'))

### Generate reliability bubble charts

Charts are generated by the methods `plot_reliabubble()` (for the IVAP) and `calibrutils.reliabubble()` (for the default) included in `reliabots`.

In [None]:
for ncm in ['logit']:
    # fig = info['logit'][1]['model'].plot_reliabubble(num_bins=10, size_max=30, font_size=18)
    for t in range(len(seeds)):
        fig = info[ncm][t+1]['model'].plot_reliabubble(num_bins=10, size_max=30, font_size=18)
        fig.write_image(f'./{folder}-vap-{ncm}-{t+1}.pdf')
        
for t in range(len(seeds)):
    fig = cu.reliabubble(info['default'][t+1]['scorez'], info['default'][t+1]['y_test'], num_bins=10, size_max=30, font_size=18)
    fig.write_image(f'./{folder}-{t+1}.pdf')

We have just saved all our charts, which means 1 chart per model per trial. Let's have a look at the reliabubble chart for the first trial of our IVAP model:

In [None]:
info['logit'][1]['model'].plot_reliabubble()