<a href="https://colab.research.google.com/github/micheldc55/Deep-Learning/blob/main/fine_tunning_DL_networks_glue_benchmark_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install datasets
! pip install transformers
! pip install torchinfo

In [39]:
import datasets
import transformers
import numpy as np
import pprint
import torchinfo
import json

In [3]:
raw_dataset = datasets.load_dataset('glue', 'sst2')

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Exploring the new dataset

Below we explore what the dataset consists of. We'll see that it is similar to a dictionary (a DatasetDict from the huggingface library [documentation here](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict)).

As per the documentation, a DatasetDict is: "Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.".

The DatasetDict behaves as a dictionary for all intents and purposes. We can access each of the datasets (train, validation, test) by passing the key to the raw_dataset in the same way that we would pass it to a normal Python dictionary.

In [4]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [5]:
raw_dataset['train']

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

Below we check the "dir" function on the dataset instanciation. This shows us what attributes does the DatasetDict have. Notice there are some well-known methods like "get", "keys" (for dictionaries), "map", "unique", "values" (for DataFrames), and "shuffle".

In [6]:
dir(raw_dataset)

['__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_values_features',
 '_check_values_type',
 'align_labels_with_mapping',
 'cache_files',
 'cast',
 'cast_column',
 'class_encode_column',
 'cleanup_cache_files',
 'clear',
 'column_names',
 'copy',
 'data',
 'filter',
 'flatten',
 'formatted_as',
 'from_csv',
 'from_json',
 'from_parquet',
 'from_text',
 'fromkeys',
 'get',
 'items',
 'keys',
 'load_from_disk',
 'map',
 'num_columns',
 'num_rows',
 'pop',
 'popitem',
 'prepare_for_task',
 'push_to_hub',
 'remove_columns',
 'rename_column',
 'rename_columns',
 '

We can check the same thing for the Dataset object. Notice that this has additional methods like "to_csv" or "to_json" that the previous one didn't have.

In [7]:
dir(raw_dataset['train'])

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_estimate_nbytes',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_save_to_disk_single',
 '_select_contiguous',
 '_select_with_indices_mapping',
 '_split',
 'add_column',
 'add_elasticsearch_index',
 'add_faiss_index',
 'add_fai

In [8]:
type(raw_dataset['train'])

datasets.arrow_dataset.Dataset

Let's check out the .data attribute of the Dataset object. Using this attribute we can have a look at a few examples of the sentences in the dataset, the labels corresponding to each sentence, and finally the indexes. Notice that they are a list of lists. The latter correspond to each of the batches of data, batched in units of 1000 entries.

In [9]:
raw_dataset['train'].data

MemoryMappedTable
sentence: string
label: int64
idx: int32
----
sentence: [["hide new secretions from the parental units ","contains no wit , only labored gags ","that loves its characters and communicates something rather beautiful about human nature ","remains utterly satisfied to remain the same throughout ","on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ",...,"you wish you were at home watching that movie instead of in the theater watching this one ","'s no point in extracting the bare bones of byatt 's plot for purposes of bland hollywood romance ","underdeveloped ","the jokes are flat ","a heartening tale of small victories "],["suspense , intriguing characters and bizarre bank robberies , ","a gritty police thriller with all the dysfunctional family dynamics one could wish for ","with a wonderful ensemble cast of characters that bring the routine day to day struggles of the working class to life ","nonetheless appreciates the art and reveals a music sc

We can also access data at index level (datasets are indexed the same way as Python lists are). By running the commando below we get the first sentence in the data, with its corresponding label and index, all packed within a Python dictionary. We can also access multiple indexes by slicing, just like in Python.

In [10]:
raw_dataset['train'][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [11]:
raw_dataset['train'][1:4]

{'sentence': ['contains no wit , only labored gags ',
  'that loves its characters and communicates something rather beautiful about human nature ',
  'remains utterly satisfied to remain the same throughout '],
 'label': [0, 1, 0],
 'idx': [1, 2, 3]}

We can access the metadata of the columns (features) by accessing the ".features" attribute of the Dataset.

In [12]:
raw_dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Importing the model checkpoint and the Tokenizer

Next, we want to import the pre-trained model we want to use for this dataset. We are going to use distilbert uncased for this dataset, which is a variation of BERT that can be loaded from the huggingface transformers library. [Check the documentation here](https://huggingface.co/distilbert-base-uncased).

We are also going to import the tokenizer for distilbert from the same library. The tokenizer handles the text preprocessing to fit distilbert's needs. BERT variations have a lot of specifics that need to be created before feeding the data to the model. The tokenizer transforms the data so that they are consumable by the model.

We are using **distilbert** because it is a lighter version of BERT that is faster to train.

In [13]:
checkpoint = "distilbert-base-uncased"
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Below we convert the first four sentences to their tokenized version. 

**Note:** The "token" will not be present, as this is a tokenization necessary for BERT, and not Distilbert.

In [14]:
tokenized_examples = tokenizer(raw_dataset['train'][0:4]['sentence'])

pprint.pprint(tokenized_examples)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
               [101,
                3397,
                2053,
                15966,
                1010,
                2069,
                4450,
                2098,
                18201,
                2015,
                102],
               [101,
                2008,
                7459,
                2049,
                3494,
                1998,
                10639,
                2015,
                2242,
                2738,
                3376,
                2055,
                2529,
                3267,
                102],
               [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102]]}


### Building the tokenize auxiliary function

This function aids us in applying the tokenizer function to each element in the dataset. The process is built so that the function is applyed batch-by-batch. The way HuggingFace sets up the fine-tuning process is such that we don't need to pad the data, simply apply the truncation.

We use the map method to pass the function through all the dataset.

In [15]:
def tokenize_function(batch, tokenizer):
  return tokenizer(batch['sentence'], truncation=True)

In [16]:
tokenized_dataset = raw_dataset.map(lambda x: tokenize_function(x, tokenizer=tokenizer), batched=True)

  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Next, we create the Training Arguments object. This Training Arguments will be passed to the model during training and they provide an easy way to have a saving and evaluation strategy.

Selecting an "epoch" evaluation strategy makes it so that the metric is evaluated each time all the elements in the training data pass through the model once (epoch).

Selecting an "epoch" saving strategy makes it so that the model is saved after every "epoch". By default, the model is saved after each batch. This is too much!

In [17]:
training_arguments = transformers.TrainingArguments("distilbert_on_glue_benchmark", evaluation_strategy='epoch', save_strategy="epoch", num_train_epochs=1)

In [18]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2
)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [19]:
type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

In [20]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [21]:
torchinfo.summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

When fine-tunning NNs we will be training all the parameters within the network. In order to check that this is the case, we are going to save the model parameters as they are in order to check if the parameters have changed down the line.

We do this by getting the named_parameters from the model and transforming them to a numpy array that is saved on the params_before list.

In [22]:
params_before = []

for name, p in model.named_parameters():
  params_before.append(p.detach().cpu().numpy())

Let's select the metric using the datasets.load_metric function and save it to the "metric" object. Then we can run the metric.compute() method on a list to check that it correctly calculates the accuracy. Remember that the metric.compute method calls the predictions of the model "predictions", and the true labels of the data as "references".

Below we pass an example of three fake entries. We simulate that our model "classified" the 3 data points as 1, 0, 1, and the actual labels are 0, 0, 1, so we would expect to get a 66,67% accuracy.

In [23]:
metric = datasets.load_metric("glue", "sst2")

  metric = datasets.load_metric("glue", "sst2")


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Notice that the output of the metric computation is a dictinoary. This is importante because we may need to build our own metrics, and to do so we need to make sure that the function returns a dictionary as well.

In [24]:
metric.compute(predictions=[1, 0, 1], references=[0, 0, 1])

{'accuracy': 0.6666666666666666}

Below, we build a function that computes the metric. By default the model will output a logits of the data, and so we have to convert them to classes. We can do so by using the argmax function. The reason for doing this instead of passing the metric directly is that our model's output will be this logits results, and we want to check to what class they correspond, which is what we will have in our true labels.

In [25]:
def compute_metrics(logits_and_labels, hf_metric):
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)

  return hf_metric.compute(predictions=predictions, references=labels)

# Training the model

Now we will train the model. For that we will use the transformers.Trainer object. This object allows us to easily train the model using the training arguments, model and Dataset that we have instanciated.

The Trainer object needs to the tokenizer as well (even if we have previously tokenized the data).

In [28]:
trainer = transformers.Trainer(
    model, training_arguments, 
    train_dataset=tokenized_dataset['train'], eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=lambda x: compute_metrics(x, metric)
)

Now, we begin the training by calling the trainer.train() method!

In [29]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 67349
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 8419
  Number of trainable parameters = 66955010


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1785,0.392807,0.894495


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 8
Saving model checkpoint to distilbert_on_glue_benchmark/checkpoint-8419
Configuration saved in distilbert_on_glue_benchmark/checkpoint-8419/config.json
Model weights saved in distilbert_on_glue_benchmark/checkpoint-8419/pytorch_model.bin
tokenizer config file saved in distilbert_on_glue_benchmark/checkpoint-8419/tokenizer_config.json
Special tokens file saved in distilbert_on_glue_benchmark/checkpoint-8419/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=8419, training_loss=0.15086521112886286, metrics={'train_runtime': 444.5607, 'train_samples_per_second': 151.496, 'train_steps_per_second': 18.938, 'total_flos': 518400815624736.0, 'train_loss': 0.15086521112886286, 'epoch': 1.0})

By running the cell below, we are saving the trained model in the current directory.

In [30]:
trainer.save_model("glue_distilbert_model")

Saving model checkpoint to glue_distilbert_model
Configuration saved in glue_distilbert_model/config.json
Model weights saved in glue_distilbert_model/pytorch_model.bin
tokenizer config file saved in glue_distilbert_model/tokenizer_config.json
Special tokens file saved in glue_distilbert_model/special_tokens_map.json


Now we are running ls to list the current directory. Then we will run the ls command on the model directory as well, to see what we get.

In [31]:
!ls

distilbert_on_glue_benchmark  glue_distilbert_model  sample_data


In [32]:
!ls glue_distilbert_model

config.json	   special_tokens_map.json  tokenizer.json     vocab.txt
pytorch_model.bin  tokenizer_config.json    training_args.bin


## Testing the model on novel data:

Now that we have built and trained the model (fine-tunning all the layers), we can now test the model on data that it has never seen. We create a few reviews and test the model on those reviews to see how it performs. For that, we can use the "pipeline" object from the transformers library and call the model directly from the directory.

The "pipeline" object can be called on a path and it will fetch the saved model from the selected path. 

In [None]:
trained_model = transformers.pipeline("text-classification", model="glue_distilbert_model", device=0)

In [36]:
test_list_reviews = ['What a bad movie', 'That was not bad', 'Awesome display of character building!']

In [37]:
trained_model(test_list_reviews)

[{'label': 'LABEL_0', 'score': 0.9987512826919556},
 {'label': 'LABEL_1', 'score': 0.9974563717842102},
 {'label': 'LABEL_1', 'score': 0.9995433688163757}]

The model seems to be working correctly, but the problem is that the labels are not easily interpretable. Unfortunately there isn't a way to indicate the labels to the trainer directly. What we can do is edit the config.json file of the model, so that when it's loaded, it carries the labels with it.

We can check what the config.json file of the model contains. The labels are contained in a key called "id2label".

In [38]:
!cat glue_distilbert_model/config.json

{
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 30522
}


There's no sign of the "id2label" key. So we can use the json library (a native library for handling json files in Python) to modify the config.json file and update it with our custom labels with a dictionary.

**Notice the openning with "writing" privileges in the second with clause!**

In [40]:
config_path = "glue_distilbert_model/config.json"

with open(config_path) as f:
  j = json.load(f)

j['id2label'] = {"0": "negative", "1": "positive"}

with open(config_path, "w") as f:
  json.dump(j, f, indent=2)

In [41]:
!cat glue_distilbert_model/config.json

{
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 30522,
  "id2label": {
    "0": "negative",
    "1": "positive"
  }
}

There it is!! Now let's re-import the model with the updated config file. This should make it so that when our model is imported, the custom labels will be correctly assigned. Notice that we are calling the exact same pipeline with the exact same parameters!

In [None]:
trained_model = transformers.pipeline("text-classification", model="glue_distilbert_model", device=0)

In [44]:
test_list_reviews

['What a bad movie',
 'That was not bad',
 'Awesome display of character building!']

In [43]:
trained_model(test_list_reviews)

[{'label': 'negative', 'score': 0.9987512826919556},
 {'label': 'positive', 'score': 0.9974563717842102},
 {'label': 'positive', 'score': 0.9995433688163757}]

**BOOM**

## Validating that all parameters have been modified

As a sanity check, it's interesting to check if all parameters have been modified in the training. We said that fine-tunning modified every parameter, so we would expect the parameters to have changed at least slightly.

In [57]:
params_after = []

for name, p in model.named_parameters():
  params_after.append(p.detach().cpu().numpy())

diff = []
for p1, p2 in zip(params_before, params_after):
  diff.append(np.sum(np.abs(p1 - p2)))

Finally, we check that all differences are not null

In [62]:
np.all(diff)

True