<a href="https://colab.research.google.com/github/micheldc55/Deep-Learning/blob/main/multi_input_text_classification_with_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Training an ML model with two inputs instead of one

We can use the same model with two inputs instead of two. Basically, we want to give two sentences to the model and get an output that determines if the second sentence is related to the first one or not.

In [1]:
! pip install transformers datasets
! pip install -U transformers
! pip install -U cuda

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement cuda (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for cuda[0m[31m
[0m

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0


In [3]:
import datasets
import transformers
import numpy as np
import sklearn.metrics

## Loading & Exploring the dataset:

The Dataset I'm are going to be using for this is the "Recognizing Text Entailment" from the GLUE benchmark. This dataset consists of pairs of sentences and a label that indicates if the second sentence is related to the first or not. This is called "Entailment". This dataset originally has 3 labels, which are "entailed", "not entailed" or "neutral" which simply means that they are partially related.

In [4]:
raw_datasets = datasets.load_dataset('glue', 'rte')



  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [6]:
raw_datasets['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'not_entailment'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [7]:
raw_datasets['train']['sentence1'][:3]

['No Weapons of Mass Destruction Found in Iraq Yet.',
 'A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.',
 'Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients.']

In [8]:
raw_datasets['train']['sentence2'][:3]

['Weapons of Mass Destruction Found in Iraq.',
 'Pope Benedict XVI is the new leader of the Roman Catholic Church.',
 'Herceptin can be used to treat breast cancer.']

In [9]:
raw_datasets['train']['label'][:3]

[1, 0, 0]

In [10]:
raw_datasets['test']['label'][:3]

[-1, -1, -1]

## Model Checkpoint

We are going to be using a pre-trained BERT model for this process. I'm going to show two options, one using BERT and one using Distil-BERT which is a lighter version of the BERT Language Model.

In [11]:
model_name = 'distilbert'

if model_name == 'distilbert':
  checkpoint = 'distilbert-base-cased'
elif model_name == 'bert':
  checkpoint = 'bert-base-cased'
else:
  raise NameError('You did not pass a correct model for this process')

We are going to be using the tokenizer for the same model, so it is a good practice to save the model into a checkpoint string and use the same model across the different calls of the notebook. This way if we change the model checkpoint, we will be changing it across the antire notebook once we run it.

HugginfFace allows us to be robust in the sense that we get an API that can handle many different inputs and use the same API for interacting with the model and tokenizer. It is a good practice to leverage that!

It's important to note that the robustness is important here. The Distil-BERT model has no token_type_ids while the BERT version does. This is why we have to be consistent in which checkpoint we pass, as we want the tokenizer to be the same as the model we are going to train.

In [12]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

Let's create an example using the first pair of sentences. Below we pass the two sentences to the tokenizer, check how the tokenizer treats them and how it stores the data in a dictionary, and finally decode the tokenized data to understand what is the model doing.

In [13]:
sentence1 = raw_datasets['train']['sentence1'][0]
sentence2 = raw_datasets['train']['sentence2'][0]

tokenized_sentences = tokenizer(
    sentence1, 
    sentence2
)
tokenized_sentences

{'input_ids': [101, 1302, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 6355, 119, 102, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
tokenized_sentences.keys()

dict_keys(['input_ids', 'attention_mask'])

In [15]:
tokenizer.decode(tokenized_sentences['input_ids'])

'[CLS] No Weapons of Mass Destruction Found in Iraq Yet. [SEP] Weapons of Mass Destruction Found in Iraq. [SEP]'

Note that in previous uses of the model, the [SEP] and [CLS] tokens seemed irrelevant, but they actually serve the purpose of concatenating the results when we have a multi-input case like this one. The model will now understand that the first sentence is "No Weapons of..." and the second sentence is "Weapons of Mass...".

### Importing the model for normal Classification

Once we have tokenized the data, it's easy to see that we have taken a multi-input problem and converted it into a signle input problem. We have taken these two sentences and concatenated them into a single string (with the corresponding tokens for indicating begining or ending of a sentence).

Now the process is the same as before. We have to import the pre-trained model and re-train it on the task that we want to perform. For the inport, we are going to use HuggingFace's transformers.AutoModelForSequenceClassification. Particularly, the .from_pretrained method, as this is a model that the library has saved for us.

In [16]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model.to('cuda:0')

model

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [17]:
training_arguments = transformers.TrainingArguments(
    output_dir = 'whatever_dir_path', 
    evaluation_strategy = 'epoch', 
    save_strategy = 'epoch', 
    num_train_epochs = 5, 
    per_device_train_batch_size = 16, 
    per_device_eval_batch_size = 64, 
    logging_steps = 150  # This is the number of logs the model has to go through in order to output a result. If we comment this line we won't see anything
                         # for the first few iterations because the sample is so small.
)

In [18]:
metric = datasets.load_metric('glue', 'rte')

  metric = datasets.load_metric('glue', 'rte')


In [19]:
metric.compute(predictions=[1, 0, 1], references=[1, 0, 0])

{'accuracy': 0.6666666666666666}

In [20]:
def compute_metrics(logits_and_labels):
  """Function to compute the metrics that we want during the training process. 
  The only metric for this benchmark is "Accuracy" and we are interested in at 
  least getting the F1 score as well.

  :param logits_and_labels: Object that contains the logits from the model 
  and the correct label for each sample in the dataset.
  :type logits_and_labels: dunno
  :return: Dictionary containing the accuracy result and f1 score for the 
  predictions
  :rtype: dict
  """
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  accuracy = np.mean(predictions == labels)
  f1_score = sklearn.metrics.f1_score(labels, predictions)
  return {'accuracy': accuracy, 'f1': f1_score}

In [21]:
def tokenize_function(batch):
  """Function that applies the tokenizer function to each sentences and returns 
  the result, with truncation per batch, if necessary.
  """
  return tokenizer(batch['sentence1'], batch['sentence2'], truncation=True) #, return_tensors='pt', padding=True)

Now we convert the dataset to a tokenized dataset by mapping the "tokenize_function" to the dataset. We are going to apply batch=True so that the output is still batched.

In [22]:
tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)



### Freezing the base model for faster training:

Both BERT and Distil-BERT models have a lot of parameters. This means that training may be too intense for the environment in which I'm training, especially if there's no GPU. A good alternative would be to fine-tune instead of using transfer learning. For that we would simply train the last dense, fully connected layers of the model.

With the HuggingFace interface (and even with PyTorch) this is really simple to do, we simply have to update the requires_grad attribute of the parameters from the base model, but leave the rest intact. Here the base model refers to the BERT/Distil-BERT model.

In [None]:
freeze_BERT = False

if freeze_BERT:
  for parameter in model.base_model.parameters():
      parameter.requires_grad = False

## Building the model Trainer

Below we are calling the Trainer class so that we can train our model. We pass the general parameters to the trainer, like the model, the training arguments, the training and validation data, and the tokenizer/metrics.

In [23]:
trainer = transformers.Trainer(
    model, 
    training_arguments,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2490
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 780
  Number of trainable parameters = 65783042
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


## Computing metrics

Below we will compute the metrics on all the datasets (Train, Validation and Test). We should expect the Train/Validation metrics to be the same as before. But we want to measure performance against the Test set.

In [None]:
predictions_train      = trainer.predict(tokenized_dataset['train'])
predictions_validation = trainer.predict(tokenized_dataset['validation'])

In [None]:
predictions_test = trainer.predict(tokenized_dataset['test'])

In [28]:
predictions_train

PredictionOutput(predictions=array([[-2.728595 ,  2.7723818],
       [ 3.1665783, -3.6411455],
       [ 2.714723 , -3.2658825],
       ...,
       [-3.4910307,  3.5919921],
       [ 3.1877217, -3.7353778],
       [-3.4024603,  3.6397007]], dtype=float32), label_ids=array([1, 0, 0, ..., 1, 0, 1]), metrics={'test_loss': 0.01936996355652809, 'test_accuracy': 0.9963855421686747, 'test_f1': 0.9963753523962948, 'test_runtime': 16.2505, 'test_samples_per_second': 153.226, 'test_steps_per_second': 2.4})

In [54]:
batch

{'input_ids': tensor([[  101,  2268,  1403,  ...,     0,     0,     0],
        [  101, 14593,  4233,  ...,     0,     0,     0],
        [  101,   138, 26826,  ...,     0,     0,     0],
        ...,
        [  101,  1109, 16289,  ...,     0,     0,     0],
        [  101,  1109,  4301,  ...,     0,     0,     0],
        [  101,  3985,   155,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])}

In [42]:
model = transformers.AutoModel.from_pretrained("/content/whatever_dir_path/checkpoint-156")

Some weights of the model checkpoint at /content/whatever_dir_path/checkpoint-156 were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
for batch in trainer.get_test_dataloader(tokenized_dataset['test']):
  break

output = trainer.model(**batch)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.


In [37]:
output

SequenceClassifierOutput(loss=tensor(1.7241, grad_fn=<NllLossBackward0>), logits=tensor([[ 2.4714, -2.6001],
        [-3.2649,  3.3750],
        [-1.5217,  1.6294],
        [-2.5806,  2.9048],
        [ 0.6705, -0.9930],
        [ 2.7955, -3.3395],
        [-2.9828,  3.4178],
        [ 2.7133, -3.2335],
        [ 1.9685, -2.0542],
        [ 3.1729, -3.8667],
        [ 0.3388, -0.8330],
        [ 2.1602, -2.5382],
        [-2.3998,  2.5379],
        [ 2.6425, -3.1924],
        [-3.5329,  3.7577],
        [-1.2877,  1.4007],
        [ 2.7111, -3.1779],
        [-3.4030,  3.4304],
        [ 2.1358, -2.6324],
        [-3.2666,  3.4903],
        [ 3.0442, -3.5383],
        [ 2.0039, -2.0741],
        [-1.2839,  1.3047],
        [ 3.2541, -3.8351],
        [-2.9732,  3.2110],
        [-3.2275,  3.3397],
        [-3.4710,  3.7037],
        [-0.5190,  0.4237],
        [-3.4174,  3.6735],
        [-3.1052,  3.3682],
        [-1.1436,  1.2350],
        [-2.6938,  2.9729],
        [ 2.0424, -2.22