<a href="https://colab.research.google.com/github/mdeevan/LightweightFineTuning/blob/main/LightweightFineTuning_bert_large_uncased_qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: <br>
**qLoRA** (quantized Low Rank Adaptation). It quantized the base model parameters to allow large model to fit in smaller memory. While compute time is increased for quantization and de-quantization, fitting a large model in small memory makes is memory efficient. LoRA on other hand decomposes a large matrix into small matrices, reducing number of parameters. It requires less memory and speeds up fine-tuning.
<br>https://huggingface.co/docs/peft/developer_guides/lora
<br>https://huggingface.co/docs/peft/main/en/developer_guides/quantization
<br>

* Model: <br>
**google-bert/bert-large-uncased** :  
<br>https://huggingface.co/google-bert/bert-large-uncased
<br>

* Evaluation approach: <br>
**seqeval** framework for sequence labeling evaluation. It evaluates the precision, recall and f1 score.
<br>https://huggingface.co/spaces/evaluate-metric/seqeval
<br>

* Fine-tuning dataset: <br>
**financial_phrasebank** based on the financial news, a multi-class-classification with three sentiments (positive, negative and neutral)
<br>https://huggingface.co/datasets/financial_phrasebank


## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install accelerate


Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install transformers --upgrade
!pip install evaluate seqeval
!pip install peft
!pip install bitsandbytes

Collecting transformers
  Downloading transformers-4.39.1-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed transformers-4.39.1
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.18.0-py3-none-any.whl 

#### IMPORTS

In [3]:
from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer, DataCollatorWithPadding,
                          TrainingArguments, Trainer,
                          BitsAndBytesConfig)

from datasets     import load_dataset

from peft import (LoraConfig, get_peft_model, TaskType,
                  LoftQConfig, prepare_model_for_kbit_training)


import torch
import evaluate
import numpy as np
import pandas as pd


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Define Variables & Load dataset

In [5]:
# https://www.evidentlyai.com/classification-metrics/multi-class-metrics

accuracy  = evaluate.load('accuracy')
f1        = evaluate.load('f1')
precision = evaluate.load('precision')
recall    = evaluate.load('recall')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [6]:
# checkpoint = "distilbert/distilroberta-base"
checkpoint = 'google-bert/bert-large-uncased'
data_file = "financial_phrasebank"
data_file_subset = "sentences_66agree"

In [7]:
# import numpy as np
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    return accuracy.compute(predictions=predictions, references=labels)

## Load Datasets

In [8]:
def loadDataset(dataFile=data_file, dataFileSubset=data_file_subset, seed=42):
  raw_dataset = load_dataset(path=data_file,
                            name=data_file_subset,
                            split="train").train_test_split(test_size=0.2,
                                                            shuffle=True,
                                                            seed=42)

  raw_train = raw_dataset.pop('train')
  raw_train_valid = raw_train.train_test_split(test_size=.1, shuffle=True, seed=42)
  raw_dataset['train'] = raw_train_valid.pop('train')
  raw_dataset['eval'] = raw_train_valid.pop('test')

  labels = raw_dataset["train"].features['label'].names
  label2id = {l:i for i, l in enumerate(labels)}
  id2label = {i:l for i, l in enumerate(labels)}

  return raw_dataset, label2id, id2label


# Load Model

In [9]:
# Quantize the model

def loadModel(checkpoint=checkpoint, label2id={}, id2label={}):

  quant_config=BitsAndBytesConfig(load_in_4bit = True,
                                  bnb_4bit_quant_type="nf4",
                                  bnb_4bit_use_double_quant=True,
                                  bnb_4bit_compute_dtype=torch.bfloat16
                                  )

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                            num_labels = len(label2id),
                                                            id2label=id2label,
                                                            label2id=label2id,
                                                            quantization_config=quant_config,
                                                            device_map={"":0}
                                                            )

  return prepare_model_for_kbit_training(model)


In [10]:
def loadTokenizer(checkpoint=checkpoint):
  return AutoTokenizer.from_pretrained(checkpoint)




In [11]:

def tokenize_function(data, tokenizer):
    return tokenizer(data['sentence'],
#                      max_length=input_max_length,
                     truncation=True,
#                      padding='max_length'
                    )


def get_tokenized_dataset(raw_dataset, tokenizer):
  return raw_dataset.map(tokenize_function, fn_kwargs={"tokenizer": tokenizer}, batched=True)

In [12]:
def get_data_collator(tokenizer):
 return DataCollatorWithPadding(tokenizer=tokenizer,
                                        padding=True,
#                                         padding='max_length',
#                                         max_length=input_max_length)
                                       )


In [13]:
raw_dataset, label2id, id2label = loadDataset(data_file, data_file_subset)
tokenizer = loadTokenizer(checkpoint)
tokenized_dataset = get_tokenized_dataset(raw_dataset, tokenizer)
model= loadModel(checkpoint, label2id, id2label)
data_collator = get_data_collator(tokenizer)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4217 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Map:   0%|          | 0/3035 [00:00<?, ? examples/s]

Map:   0%|          | 0/338 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
print("model :\n",model)
print("tokenizer : \n", tokenizer)
print("tokenized_dataset :\n", tokenized_dataset)
print("data_collator : \n", data_collator)

model :
 BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (key): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (value): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (Lay

In [15]:
[len(x) for x in tokenized_dataset['train'][:10]['sentence']]

[120, 75, 152, 46, 152, 90, 50, 167, 160, 49]

In [16]:
[len(x) for x in tokenized_dataset['train'][:10]['input_ids']]

[22, 19, 38, 12, 35, 20, 14, 57, 42, 16]

In [17]:

def evaluate_samples(model=model, ds=tokenized_dataset['train'], sample_start=0, sample_count=10):
  samples = ds[sample_start : sample_start+sample_count]

  samples = {k: v for k, v in samples.items() if k not in ['sentence', 'label']}

  batch = data_collator(samples ).to(device)

  output = model(**batch).logits

  predictions=torch.argmax(output, dim=1).cpu().numpy()

  return predictions


In [19]:
sample_start = 20
sample_count = 10
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


predictions=evaluate_samples(model, tokenized_dataset['train'], sample_start, sample_count)
references =tokenized_dataset['train']['label'][sample_start:sample_start+sample_count]
print("predictions = {}".format(predictions.tolist()))
print("references  = {}".format(references))

predictions = [1, 1, 1, 1, 1, 1, 1, 1, 1, 2]
references  = [1, 2, 2, 1, 2, 0, 1, 1, 1, 2]


In [20]:


clf =  evaluate.combine(["accuracy",'f1','precision','recall'])
accuracy_metric = accuracy.compute (predictions = predictions, references  = references )
f1_metric       = f1.compute       (predictions = predictions, references  = references,  average = "macro")
precision_metric= precision.compute(predictions = predictions, references  = references,   average = "macro", zero_division=0)
recall_metric   = recall.compute   (predictions = predictions, references  = references,  average = "macro")


print(accuracy_metric)
print(f1_metric)
print(precision_metric)
print(recall_metric)

{'accuracy': 0.6}
{'f1': 0.37142857142857144}
{'precision': 0.5185185185185185}
{'recall': 0.4166666666666667}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [21]:
tokenized_dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [22]:
tokenized_dataset['train'].rename_column('label','labels')

Dataset({
    features: ['sentence', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3035
})

In [25]:
# config = LoraConfig()

def get_qlora_model():
  loftq_config = LoftQConfig(loftq_bits=4)

  qlora_config = LoraConfig(task_type=TaskType.SEQ_CLS ,
                          inference_mode    = False,
                          #  init_lora_weights = "loftq",
                          #  loftq_config      = loftq_config,
                          r                 = 16,
                          lora_alpha        = 32,
                          lora_dropout      = 0.05,
                          bias              = 'none',
                          target_modules    = ['query','value', 'key',"all-linear"], # 'out_proj'],
                          modules_to_save   = ['classifier']
                          )
  qlora_model = get_peft_model(model, qlora_config )

  return qlora_model


In [26]:
qlora_model = get_qlora_model()
qlora_model.print_trainable_parameters()

trainable params: 2,362,371 || all params: 337,507,334 || trainable%: 0.6999465676796226


In [27]:
qlora_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 1024, padding_idx=0)
          (position_embeddings): Embedding(512, 1024)
          (token_type_embeddings): Embedding(2, 1024)
          (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-23): 24 x BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                   

In [28]:
Training_Arguments = TrainingArguments(
    per_device_train_batch_size = 4,
    per_device_eval_batch_size  = 4,
    output_dir                  = "bert_large_qlora_classifier",
    learning_rate               = 2e-5,
    num_train_epochs            = 5,
    weight_decay                = 0.005,
    save_strategy               = 'epoch',
    evaluation_strategy         = 'epoch',
    deepspeed                   = False,
    load_best_model_at_end      = True)

In [29]:
trainer = Trainer(
                  model=qlora_model,
                  args=Training_Arguments,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset =tokenized_dataset['eval'],
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  data_collator=data_collator
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [30]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.9072,0.759535,0.677515
2,0.6573,0.584663,0.736686
3,0.5918,0.452147,0.846154
4,0.4552,0.395976,0.85503
5,0.4112,0.394344,0.85503




TrainOutput(global_step=3795, training_loss=0.6018982858368844, metrics={'train_runtime': 1129.3179, 'train_samples_per_second': 13.437, 'train_steps_per_second': 3.36, 'total_flos': 1274829852211860.0, 'train_loss': 0.6018982858368844, 'epoch': 5.0})

In [31]:
saved_checkpoint = '/content/drive/MyDrive/ftMLC-large-bert-uncase-QLora-Mar-25-00'

In [32]:
trainer.save_model(saved_checkpoint)

In [33]:
print(trainer.evaluate())

{'eval_loss': 0.3943438231945038, 'eval_accuracy': 0.8550295857988166, 'eval_runtime': 16.0297, 'eval_samples_per_second': 21.086, 'eval_steps_per_second': 5.303, 'epoch': 5.0}


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
trainer2 = Trainer(
                  model=qlora_model,
                  args=Training_Arguments,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset =tokenized_dataset['test'],
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  data_collator=data_collator
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
print(trainer2.evaluate())

{'eval_loss': 0.35780006647109985, 'eval_accuracy': 0.8779620853080569, 'eval_runtime': 5.6903, 'eval_samples_per_second': 148.322, 'eval_steps_per_second': 18.628}


In [None]:
print(evaluate_samples(qlora_model, tokenized_dataset['test'], 10, 50))

[2 2 1 1 0 1 1 1 1 0 1 1 2 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 1 2 2 1 1 1 2 1 1
 1 2 1 2 1 0 0 2 1 0 1 1 1]


In [34]:

sample_start=20
sample_count=50

inferences = evaluate_samples(qlora_model, tokenized_dataset['test'], sample_start, sample_count).tolist()
print(inferences)

references=tokenized_dataset['test']['label'][sample_start:sample_start+sample_count]
print(references)

[1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 1, 1, 1, 0, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 0]
[2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 0]


In [36]:

sample_start=0
sample_count=4

inferences = []
references = []
for i in range(sample_start, 844,  sample_count):
  inferences += evaluate_samples(qlora_model, tokenized_dataset['test'], sample_start, sample_count).tolist()
  references += tokenized_dataset['test']['label'][sample_start:sample_start+sample_count]



In [37]:
mismatches = []
for n, (i, r) in enumerate(zip(inferences, references)):
  if i!=r:
    # mismatches.append(n)
    txt = "prediction: {}, reference: {}, sentence:{}".format(id2label[i], id2label[r], raw_dataset['test'][n]['sentence'])
    mismatches.append(txt)

In [38]:
print(len(mismatches))

422


In [39]:
print(len(mismatches)/len(inferences))

0.5


## Accuracy on the test set is 50% after training for 5 epochs. Training for additional 10 epochs

In [41]:
Training_Arguments = TrainingArguments(
    per_device_train_batch_size = 4,
    per_device_eval_batch_size  = 4,
    output_dir                  = "bert_large_qlora_classifier2",
    learning_rate               = 2e-5,
    num_train_epochs            = 10,
    weight_decay                = 0.005,
    save_strategy               = 'epoch',
    evaluation_strategy         = 'epoch',
    deepspeed                   = False,
    load_best_model_at_end      = True)
trainer = Trainer(
                  model=qlora_model,
                  args=Training_Arguments,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset =tokenized_dataset['eval'],
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  data_collator=data_collator
)

In [42]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.4178,0.420329,0.893491
2,0.4242,0.36117,0.905325
3,0.3572,0.381627,0.902367
4,0.3229,0.397611,0.899408
5,0.2995,0.348891,0.91716
6,0.3121,0.354088,0.920118
7,0.3084,0.322848,0.928994
8,0.2739,0.341263,0.923077
9,0.275,0.324152,0.931953
10,0.2683,0.330553,0.926036




TrainOutput(global_step=7590, training_loss=0.32756006092579154, metrics={'train_runtime': 2250.7859, 'train_samples_per_second': 13.484, 'train_steps_per_second': 3.372, 'total_flos': 2545257243255720.0, 'train_loss': 0.32756006092579154, 'epoch': 10.0})

In [43]:
saved_checkpoint = '/content/drive/MyDrive/ftMLC-large-bert-uncase-QLora-Mar-25-01'

In [44]:
trainer.save_model(saved_checkpoint)

In [46]:
print(trainer.evaluate())

{'eval_loss': 0.32284772396087646, 'eval_accuracy': 0.9289940828402367, 'eval_runtime': 13.843, 'eval_samples_per_second': 24.417, 'eval_steps_per_second': 6.14, 'epoch': 10.0}


In [47]:
trainer2 = Trainer(
                  model=qlora_model,
                  args=Training_Arguments,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset =tokenized_dataset['test'],
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  data_collator=data_collator
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [48]:

sample_start=20
sample_count=50

inferences = evaluate_samples(qlora_model, tokenized_dataset['test'], sample_start, sample_count).tolist()
print(inferences)

references=tokenized_dataset['test']['label'][sample_start:sample_start+sample_count]
print(references)

[1, 1, 2, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 0]
[2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 0, 1, 1, 0]


In [49]:

sample_start=0
sample_count=4

inferences = []
references = []
for i in range(sample_start, 844,  sample_count):
  inferences += evaluate_samples(qlora_model, tokenized_dataset['test'], sample_start, sample_count).tolist()
  references += tokenized_dataset['test']['label'][sample_start:sample_start+sample_count]



In [50]:
mismatches = []
for n, (i, r) in enumerate(zip(inferences, references)):
  if i!=r:
    # mismatches.append(n)
    txt = "prediction: {}, reference: {}, sentence:{}".format(id2label[i], id2label[r], raw_dataset['test'][n]['sentence'])
    mismatches.append(txt)

In [51]:
print(len(mismatches))


0


In [52]:
print(len(mismatches)/len(inferences))

0.0


In [53]:
mismatches

[]

## Test set prediction is at 100% after 15 epochs