# HW11.1 Fine-tuning BERT LLM using Huggingface Transformers library

In this homework, we will step away from tensorflow keras for a moment and instead use the Transformers library from HuggingFace (https://huggingface.co/) . The HuggingFace is a community that hosts pre-trained models from LLMs to computer vision and audio ML models. You can gain easy access to SOTA LLMs using their `transformers` library, fine tuning them, and use standard benchmark datasets from their `datasets` library (it is a generic name but the library is called datasets).

Specifically what you will do in this home work:
1. Walk through the example of loading the `sst2` dataset (Stanford Sentiment Treebank dataset, essentially a dataset for sentiment analysis) from the `GLUE` benchmark we talked about in class. The GLUE covers a range of NLP tasks and is used to benchmark LLMs. After you load the dataset, there will be some example usages to inspect the dataset.
2. From the `transformers` library, load the pretrained LLM called DistillBERT, a variant and smaller version of the famous BERT LLM.
3. Fine tune (train further) the DistillBERT model on the `sst2` dataset to achieve a better performance.
4. Evaluate your fine-tuned model on `sst2` and compare that with: (1)the model before fine-tuning; (2) the default model in the HuggingFace library that is fine tuned by experts.

Please complete all tasks/code and answer all questions.

## Requirements

You will need the following libraries at the minimum:

```
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install torchinfo
```

In [None]:
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install torchinfo

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

# 1. Load SST2 data

In [None]:
from datasets import load_dataset
import numpy as np

# to view the GLUE - SST2 data set and what it is about, see: https://huggingface.co/datasets/nyu-mll/glue
# essnentially this is a Stanford Sentiment Treebank dataset for sentiment analysis
datasets = load_dataset("glue", "sst2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [None]:
# you can inspect this dataset and see what it contains
# you will see it has been divided into three parts: train, val, and test
datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

## Task 1: inspect data text and labels

what are the labels? what does label 0 and 1 represent? Take a note of the keys in this dictionary and their values.

In [None]:
# inspect the first three examples in the datasets
for name in ['train', 'validation', 'test']:
    print(f"Dataset: {name}")
    for i in range(3):
        print(f"Sentence: {datasets[name][i]['sentence']}")
        print(f"Label: {datasets[name][i]['label']}")
    print("\n")

Dataset: train
Sentence: hide new secretions from the parental units 
Label: 0
Sentence: contains no wit , only labored gags 
Label: 0
Sentence: that loves its characters and communicates something rather beautiful about human nature 
Label: 1


Dataset: validation
Sentence: it 's a charming and often affecting journey . 
Label: 1
Sentence: unflinchingly bleak and desperate 
Label: 0
Sentence: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 
Label: 1


Dataset: test
Sentence: uneasy mishmash of styles and genres .
Label: -1
Sentence: this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
Label: -1
Sentence: by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
Label: -1




Based on the information online baout the dataset, the 0's and 1's indicate whether an example is a grammatical english sentence or not, where 0 means it is not grammatical, and 1 means it is grammatical. Honestly, I an not sure why "allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker" is a grammatical sentence. They were human annotated, so perhaps this is an issue of differring opinions.

# 2. Load pre-trained model DistillBERT and preprocess text

We've talked about how each LLM comes with its on (subword, learned) tokenizer. Here, when we load the pre-trained LLM, we also load its tokanizer.  

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences = tokenizer(datasets['train'][:3]['sentence'])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Task 2: understand what tokenizer is doing
Now we've used the tokenizer to tokenize the first three sentences in train dataset. Inspect the tokenized sentences. Let's take the first sentence. It is now represented by a sequences of integer indexes. Can you map them back to actual sub-word units to see how the tokenizer is breaking up the words?

Hint: you can do `dir(tokenizer)` to find out how to convert ids to tokens. This applies to any object in python.

In [None]:
# dir(tokenizer)

# remind ourselves of the sentence we are working with
print(f"Sentence: {datasets['train'][0]['sentence']}")
print(f"Label: {datasets['train'][0]['label']}")

# analyze the tokenization
id = tokenized_sentences['input_ids'][0]
print(f"Id: {id}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(id)}")

Sentence: hide new secretions from the parental units 
Label: 0
Id: [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102]
Tokens: ['[CLS]', 'hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units', '[SEP]']


It seems as though the the tokenizer breaks up words into sub-words as long as those sub-words are words themselves. For examples, "secretions" is broken into "secret" (of length 6) and "ions" (of length 4), yet "parental" is not broken into "parent" (of length 6) and "al" (of length 2) because "al" by itself is not a word. Similarly, "units" is not subsected into "unit" and "s", because "s" cannot stand alone. My guess is that this is done to further understand the meaning of words and word subsets, although it is not entirely clear.

The following function applies the tokenizer to all data.

In [None]:
def tokenize_fn(batch):
  return tokenizer(batch['sentence'], truncation=True)

In [None]:
tokenized_datasets = datasets.map(tokenize_fn, batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

# 3. Fine-tune the pre-trained DistillBERT model

In [None]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification

In [None]:
training_args = TrainingArguments(
  'my_trainer',
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=1,
)



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# this warning above tells you that this pretrained model was topped with a
# newly initialized classifier that needs to be trained/fine-tuned
# let's inspect this model and understand its internal structure

model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
from torchinfo import summary
# another way to inspect the model
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
from transformers import Trainer
from evaluate import load
# define function to compute metrics
def compute_metrics(logits_and_labels):
  metric = load("glue", "sst2")
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [None]:
# set up trainer to fine-tune the model
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


## Task 3: fine tune the model for 1 epoch!
Note that this might take some time.

Note that the epoch number was set above in the training arguments.

After fine tuning 1 epoch, report the final accuracy.

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmaddysilveira[0m ([33mmaddysilveira-tufts-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1976,0.347372,0.904817


Downloading builder script: 0.00B [00:00, ?B/s]

TrainOutput(global_step=8419, training_loss=0.2668579514794939, metrics={'train_runtime': 583.0063, 'train_samples_per_second': 115.52, 'train_steps_per_second': 14.441, 'total_flos': 517212489917652.0, 'train_loss': 0.2668579514794939, 'epoch': 1.0})

In [None]:
# save the model to disk so that you can load it back later
trainer.save_model('my_saved_model')

# use this code to massage the labels into something interpretable, NEGATIVE, POSITIVE
import json
config_path = 'my_saved_model/config.json'
with open(config_path) as f:
  j = json.load(f)

j['id2label'] = {0: 'NEGATIVE', 1: 'POSITIVE'}

with open(config_path, 'w') as f:
  json.dump(j, f, indent=2)

## Use the saved model for inference on new sentences

Now you can use this newly fine-tuned model to build a `pipeline`, an object in the trnasformers library. The pipeline can be used to make inference on a input sentence.

In [67]:
from transformers import pipeline
new_model = pipeline('text-classification', model='my_saved_model')

# test your new pipeline
new_model('This movie is great!')

# test with more examples
examples = ['ooga booga', "big wash in the holiday", "nothing was green", "how is that you could today", "which in the pot are we", "seven is seven as can be", "how would if the dream came true"]
for x in examples:
    print()
    print(f"Sentence: {x}")
    print(new_model(x))

Device set to use cuda:0



Sentence: ooga booga
[{'label': 'NEGATIVE', 'score': 0.9553182125091553}]

Sentence: big wash in the holiday
[{'label': 'POSITIVE', 'score': 0.765505313873291}]

Sentence: nothing was green
[{'label': 'NEGATIVE', 'score': 0.9981617331504822}]

Sentence: how is that you could today
[{'label': 'POSITIVE', 'score': 0.9842858910560608}]

Sentence: which in the pot are we
[{'label': 'POSITIVE', 'score': 0.9501135945320129}]

Sentence: seven is seven as can be
[{'label': 'POSITIVE', 'score': 0.9800037741661072}]

Sentence: how would if the dream came true
[{'label': 'POSITIVE', 'score': 0.9757164716720581}]


I am not sure I understand this model correctly, because "nothing was green" is definitely a sentence, while "which in the pot are we" does not seem to be a grammatical sentence. I also feel like it relies too heavily on probabilities than it has actually learned the grammar.

# 4. Evaluate the model: how was the result of the fine-tuning?

Once you trained a model, it's always important to show through proper evaluation that this fine-tuned model is indeed better than before fine tuning, or compare this with models fine-tuned by other people.  

To use HuggingFace's evaluator, install:
`!pip install evaluate`

In [88]:
from evaluate import evaluator

# first let's load the test portion of the sst2 data
test_datasets = load_dataset("glue", "sst2", split="test")

# let's compare three models and evaluate the against each other.

# Model 1: pre-trained model distillBERT as is. Since this is added some new
# classifier layers, it is expected to have low performance.
# let's load this model again.
checkpoint = "distilbert-base-uncased"
from transformers import AutoModelForSequenceClassification
model_distillBERT = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Model 2: the model you fine tuned. For this one, we already have the pipeline
# called new_model, we can use this directly for evaluation.

In [None]:
# Model 3: the default model for the evaluator if you don't give it any model.
# i.e., you would not supply the argument for model_or_pipeline in the following.
# In this case, it defaults to a model that was fine-tuned by others.

## Task 4: evaluate the three models!
report the results for Model 1, 2 and 3 above on the `test` portion of the `sst2` dataset. What results do you get? Can you think of why?

Now try testing the three models on the `validation` portion of the same dataset. Report the results. What do you observe?

Hint 1: if you are testing a certain model and got an error about the labels, you might want to use one of the lines that is commented out below and swap it out with another line.

Hint 2: if you can't figure out what's wrong about your accuracy, try go back to inspect the data!


In [84]:
# setting up the evaluator

from evaluate import load
task_evaluator = evaluator("text-classification")

def eval_results(model, data, labels):
    return task_evaluator.compute(
        model_or_pipeline=model,
        data=data,
        input_column="sentence",
        tokenizer=tokenizer,
        metric='accuracy',
        label_mapping=labels
    )

In [85]:
print(eval_results(model_distillBERT, test_datasets, {"LABEL_0": 0.0, "LABEL_1": 1.0}))

Device set to use cuda:0


{'accuracy': 0.0, 'total_time_in_seconds': 9.77838408000025, 'samples_per_second': 186.22708876045226, 'latency_in_seconds': 0.005369788072487782}


In [81]:
print(eval_results(new_model, test_datasets, {"NEGATIVE": 0, "POSITIVE": 1}))

{'accuracy': 0.0, 'total_time_in_seconds': 16.14224120899962, 'samples_per_second': 112.8096140073013, 'latency_in_seconds': 0.008864492701262834}


In [83]:
print(eval_results(None, test_datasets, {"NEGATIVE": 0, "POSITIVE": 1}))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'accuracy': 0.0, 'total_time_in_seconds': 20.78539642099986, 'samples_per_second': 87.60958718883086, 'latency_in_seconds': 0.011414275903898879}


All the accuracies are zero! At first, this was concerning. I went over my previous code to see why this was the case, and was reminded that the test examples have labels of -1, which is essentially a non-label. This is probably because Huggingface offers a variety of tasks based on this data, so they want to make sure that people can't cheat to get to the top of the leaderboard. This is why we also have validation to test with.

In [87]:
val_dataset = load_dataset("glue", "sst2", split="validation")

In [89]:
print(eval_results(model_distillBERT, val_dataset, {"LABEL_0": 0.0, "LABEL_1": 1.0}))

Device set to use cuda:0


{'accuracy': 0.5470183486238532, 'total_time_in_seconds': 8.72125763799977, 'samples_per_second': 99.98557962564607, 'latency_in_seconds': 0.01000144224541258}


In [91]:
print(eval_results(new_model, val_dataset, {"NEGATIVE": 0, "POSITIVE": 1}))

{'accuracy': 0.9048165137614679, 'total_time_in_seconds': 4.976570111000001, 'samples_per_second': 175.22108210081637, 'latency_in_seconds': 0.005707075815366973}


In [93]:
print(eval_results(None, val_dataset, {"NEGATIVE": 0, "POSITIVE": 1}))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'accuracy': 0.9105504587155964, 'total_time_in_seconds': 9.83904523000001, 'samples_per_second': 88.62648555992044, 'latency_in_seconds': 0.011283308750000012}


Testing on the validation dataset provides the accuracies I was expecting. The distillBERT model offered 55% accuracy -- hardly above random chance on a binary classification. Our fine-tuned model offered surprisingly good accuracy (90%) given the limited fine tuning we did. It highlights the fact that the model itself already has a pretty robust background, so running it over some training data already goes a long way. As expected, the expertly fine-tuned model performs the best with 91%. I was surprised that the results were so close. Clearly, there was more we could be doing to fine tune it, but for the purposes of this homework assignment, I am satisfied with our results.