# Exercise: Fine-tuning transformers

In this exercise we will fine-tune the BERT transformer model to achieve impressive results on various NLP tasks.

**Note:** It is recommended to run this notebook in Google Colab using GPU acceleration (Runtime > Change runtime type > GPU).

We will use the `transformers` library from [HuggingFace](https://huggingface.co/), which provides a flexible, high-level API for applying state-of-the-art transformer models with minimal boilerplate code. Run the following line to download `transformers`:

In [1]:
# Use at the beginning of a notebook in order to use GPU
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)
physical_devices

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [2]:
!pip install tensorflow_datasets==4.4.0



In [3]:
!pip install transformers



# Part 1: Microsoft Research Paraphrase Corpus

We will begin by fine-tuning BERT to classify sentence pairs from the Microsoft Research Paraphrase Corpus (MRPC) dataset, one of the standard [GLUE Benchmark](https://gluebenchmark.com/) tasks used to evaluate NLP models.

We can load this dataset conveniently using the TensorFlow Datasets API:

In [4]:
import tensorflow_datasets as tfds
mrpc_data, mrpc_info = tfds.load('glue/mrpc', with_info=True)

## Questions:
1. Examine `mrpc_info`. What is the size of the MRPC dataset?

In [5]:
mrpc_info

tfds.core.DatasetInfo(
    name='glue',
    full_name='glue/mrpc/2.0.0',
    description="""
    GLUE, the General Language Understanding Evaluation benchmark
    (https://gluebenchmark.com/) is a collection of resources for training,
    evaluating, and analyzing natural language understanding systems.
    """,
    config_description="""
    The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of
    sentence pairs automatically extracted from online news sources, with human annotations
    for whether the sentences in the pair are semantically equivalent.
    """,
    homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398',
    data_path='/root/tensorflow_datasets/glue/mrpc/2.0.0',
    download_size=1.43 MiB,
    dataset_size=1.74 MiB,
    features=FeaturesDict({
        'idx': tf.int32,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'sentence1': Text(shape=(), dtype=tf.string),
        'sentence2': Text(shape=

In [6]:
print(f"The size of the MRPC dataset is {mrpc_info.dataset_size}")

The size of the MRPC dataset is 1.74 MiB


2. `mrpc_data['train']` and `mrpc_data['test']` are TensorFlow Dataset objects containing the train and test sets for MRPC. Use `mrpc_data['train'].take(6)` to view six samples from the train set. Hint: You can convert the object to a list with `list(...)`

In [7]:
list(mrpc_data['train'].take(6))

[{'idx': <tf.Tensor: shape=(), dtype=int32, numpy=1680>,
  'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'sentence1': <tf.Tensor: shape=(), dtype=string, numpy=b'The identical rovers will act as robotic geologists , searching for evidence of past water .'>,
  'sentence2': <tf.Tensor: shape=(), dtype=string, numpy=b'The rovers act as robotic geologists , moving on six wheels .'>},
 {'idx': <tf.Tensor: shape=(), dtype=int32, numpy=1456>,
  'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>,
  'sentence1': <tf.Tensor: shape=(), dtype=string, numpy=b"Less than 20 percent of Boise 's sales would come from making lumber and paper after the OfficeMax purchase is completed .">,
  'sentence2': <tf.Tensor: shape=(), dtype=string, numpy=b"Less than 20 percent of Boise 's sales would come from making lumber and paper after the OfficeMax purchase is complete , assuming those businesses aren 't sold .">},
 {'idx': <tf.Tensor: shape=(), dtype=int32, numpy=3017>,
  'label': <tf.Tensor: sha

3. What are the input features and target variable in MRPC? What does the target variable represent?

The input features are two sentence pairs, sentence1 and sentence2.<br>
The target variable ('label') represent whether the sentences in the pair are semantically equivalent or not.

We now will *tokenize* the input texts using WordPiece tokenization, so they can be used as input for BERT. We use the `BertTokenizer` provided by the `transformers` library:

In [8]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Let's examine the WordPiece tokenization of an arbitrary sentence:

In [9]:
x = tokenizer("Transformers are astoundingly useful.")
x

{'input_ids': [101, 19081, 2024, 2004, 24826, 15683, 2135, 6179, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Questions:
4. What do `input_ids` represent? Use `tokenizer.convert_ids_to_tokens(...)` to confirm your answer.

It seems like input_ids represent the indices words in the sentence. Lets try to confirm that:

In [10]:
tokenizer.convert_ids_to_tokens(x['input_ids'])

['[CLS]',
 'transformers',
 'are',
 'as',
 '##tou',
 '##nding',
 '##ly',
 'useful',
 '.',
 '[SEP]']

Indeed, we can see that input_ids represent the indices of words in the sentence.

5. What word from the input string was split into four "subword tokens" by the tokenizer?

The word 'astoundingly' was split into four "subword tokens":<br>
'as',<br>
'##tou',<br>
'##nding',<br>
'##ly',

For GLUE tasks like MRPC, `transformers` contains a useful function `glue_convert_examples_to_features` that uses a tokenizer to convert samples into numeric features that can be used to train deep learning models. Run the following to create this numeric dataset:

In [11]:
from transformers import glue_convert_examples_to_features

mrpc_data_train = glue_convert_examples_to_features(
    mrpc_data['train'], tokenizer, max_length=128, task='mrpc')
mrpc_data_test = glue_convert_examples_to_features(
    mrpc_data['test'], tokenizer, max_length=128, task='mrpc')
mrpc_data_train = mrpc_data_train.shuffle(100).batch(16).repeat(4)
mrpc_data_test = mrpc_data_test.shuffle(100).batch(16).repeat(4)



## Questions:
6. Use the same method as in question 2 to examine one element from `mrpc_data_train`. What are the dimensions of the input and output features for each batch?

In [12]:
list(mrpc_data_train.take(1))

[({'attention_mask': <tf.Tensor: shape=(16, 128), dtype=int32, numpy=
   array([[1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0],
          ...,
          [1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0],
          [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
   'input_ids': <tf.Tensor: shape=(16, 128), dtype=int32, numpy=
   array([[  101,  4319,  9722, ...,     0,     0,     0],
          [  101,  1037,  2976, ...,     0,     0,     0],
          [  101,  2023,  2095, ...,     0,     0,     0],
          ...,
          [  101,  1999,  1037, ...,     0,     0,     0],
          [  101,  2021,  2049, ...,     0,     0,     0],
          [  101,  1047, 15909, ...,     0,     0,     0]], dtype=int32)>,
   'token_type_ids': <tf.Tensor: shape=(16, 128), dtype=int32, numpy=
   array([[0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0],
          [0, 0, 0, ..., 0, 0, 0],
          ...,
          [0, 0, 0, ..., 0, 0, 0],
          [0, 

The input dimension of the features for each batch is (16, 128) and the output dimenstion is (16,1)

7. What did `max_length=128` do above?

The max_length=128 sets the maximum length of a sentence, thus sentences that are longer then 128 words will be truncated to fit 128 words.

We can now build our BERT-based model by using the `TFBertForSequenceClassification` model from `transformers` (TF stands for "TensorFlow"): 

In [13]:
from transformers import TFBertForSequenceClassification
mrpc_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Questions
8. Compile `mrpc_model` with the following arguments:
  * Adam optimizer with learning rate `3e-5`
   * Sparse Categorical Crossentropy loss function with `from_logits=True` (from_logits=True)`
  * `metrics='accuracy'`

In [14]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

mrpc_model.compile(
    optimizer=Adam(learning_rate=5e-5),
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics='accuracy',
)

9. Train `mrpc_model` on the MRPC training data, using `mrpc_data_test` as validation data. What is the best validation accuracy you managed to achieve? **Hint:** Recommended training settings are `epochs=3, steps_per_epoch=64, validation_steps=16`

In [15]:
mrpc_model.fit(x=mrpc_data_train, epochs=3, steps_per_epoch=64, validation_steps=16, validation_data=mrpc_data_test)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f1e90e3f510>

The best validation accuracy was achieved at epoch 1 and its 0.9102



# Part 2: Movie Reviews

We will now fine-tune BERT on a new task that is not in the GLUE benchmark. The [IMDB Sentiment Analysis Dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) contains 50k texts of movie reviews classified as positive or negative.

Load the file `IMDB Dataset.csv` as a Pandas DataFrame and split into train and test sets. **Note:** If you are running this notebook in Google Colab you must first upload the file to the runtime by using the "Files" tab on the left-hand side panel.

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split

imdb_df = pd.read_csv('IMDB Dataset.csv.gz')
imdb_df_train, imdb_df_test = train_test_split(imdb_df, test_size=0.1, random_state=42)

## Questions:
10. How many samples are in the train and test sets? Is the dataset balanced?

In [17]:
print(f"There are {imdb_df_train.shape[0]} samples in the train set.")
print(f"There are {imdb_df_test.shape[0]} samples in the test set.")

There are 45000 samples in the train set.
There are 5000 samples in the test set.


In [18]:
# Lets check the balance of the dataset:
imdb_df_train.sentiment.value_counts()/imdb_df_train.shape[0]

negative    0.500422
positive    0.499578
Name: sentiment, dtype: float64

We can see that the dataset is balanced.

11. Print out one movie review and its label from the train set. Do you agree with the label?

In [19]:
print(imdb_df_train['review'].iloc[77])

I agree that this is ONE of the very best episodes of the entire series--my only detraction would be the somewhat jarring appearance of Mark Lenard as the Romulan Commander. My reasoning is this--if you were not around for the first run of this episode, then you know Mr. Lenard as Sarek, Spock's father. And for the 2nd generation Trekkie (or Trekker--your preference) it takes you out of the scene at first. Yet he's an excellent commander as well as opposite for our captain and this episode is strongly written and well-acted by all. There are excellent points made by both sides about the cost of war vs.the price of peace and certainly does remind one of some of the best of the WWII and later era movies. Those are not my favorite genre but I certainly would recommend a fan of such to view this episode through that filter. You'll see it holds up. I'll never understand why Sci-Fi gets so little respect--the best drama comes out of placing ordinary people in extraordinary circumstances.


In [20]:
print(imdb_df_train['sentiment'].iloc[77])

positive


I agree with the label. the review is obviously positive - as the writer writes: "I agree that this is ONE of the very best episodes of the entire series"

We now must convert the dataset to numerical features so that we use it to train our transformer model. The code below will convert the data to a TensorFlow Dataset object, but some lines are missing.

## Questions:
12. Fill in the marked line in the code below to set `label` equal to `1` if the review in the current row of the DataFrame is positive, and `0` if it is negative.

13. Fill in the marked line in the code below to set `tokenized` equal to the WordPiece tokenization of the movie review in `text`. Use parameters `max_length=128, padding='max_length', truncation=True`.

In [21]:
import tensorflow as tf
import numpy as np

def imdb_gen(df):
    def g():
        for row in df.itertuples():
            text = row.review
            label = (1 if row.sentiment=='positive' else 0) ## ANSWER TO QUESTION 12 HERE
            tokenized = tokenizer(text, max_length=128, padding='max_length', truncation=True) ## ANSWER TO QUESTION 13 HERE
            yield {k: np.array(tokenized[k]) for k in tokenized}, label
    return g

input_names = ['input_ids', 'token_type_ids', 'attention_mask']
data_types = ({k: tf.int32 for k in input_names}, tf.int64)
data_shapes = ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([]))

imdb_data_train = tf.data.Dataset.from_generator(
    imdb_gen(imdb_df_train),
    data_types, data_shapes
).shuffle(100).batch(32).repeat(4)

imdb_data_test = tf.data.Dataset.from_generator(
    imdb_gen(imdb_df_test),
    data_types, data_shapes
).shuffle(100).batch(32).repeat(4)

14. Create a BERT-based classification model as in Part 1 (using the same optimizer, loss, and metric).

In [22]:
mrpc_model2 = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
mrpc_model2.compile(
    optimizer=Adam(learning_rate=5e-5),
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics='accuracy'
)

15. Train your model on `imdb_data_train`, using `imdb_data_test` as validation data. What is the best validation accuracy you managed to achieve? **Hint:** Recommended training settings are `epochs=10, steps_per_epoch=64, validation_steps=16`

In [24]:
mrpc_model2.fit(x=imdb_data_train, epochs=10, steps_per_epoch=64, validation_steps=16, validation_data=imdb_data_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1db22f1790>

The best validation accuracy was achieved at epoch 7 and its 0.8848



# Part 3: Kaggle

We are now ready to apply the skills we have learned to the "real world". For this problem you may pick any [text classification dataset from Kaggle](https://www.kaggle.com/datasets?search=text+classification).

If you are unsure of what dataset to use, you may start with the [spam text classification dataset](https://www.kaggle.com/team-ai/spam-text-message-classification), given in the accompanying file `SPAM text message 20170820 - Data.csv`.

**Note:** If you are using Google Colab for this exercise, you may need to use a small (upwards of 20k samples) dataset to avoid memory limitations.

## Questions:
16. Load your dataset into a Pandas DataFrame `kaggle_df`, and split into train and test sets with `train_test_split`. Print out the number of samples in the train and test sets, check if the dataset is balanced, and examine a few examples of samples to understand your dataset.

In [25]:
spam_df = pd.read_csv('SPAM text message 20170820 - Data.csv')
spam_df_train, spam_df_test = train_test_split(spam_df, test_size=0.1, random_state=42)
spam_df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
print(f"There are {spam_df_train.shape[0]} samples in the train set.")
print(f"There are {spam_df_test.shape[0]} samples in the test set.")

There are 5014 samples in the train set.
There are 558 samples in the test set.


In [27]:
# Lets check the balance of the dataset:
spam_df_train.Category.value_counts()/spam_df_train.shape[0]

ham     0.865576
spam    0.134424
Name: Category, dtype: float64

We can see that the dataset is imbalanced.

Lets check two messages, one is a ham and the other is a spam: classifications:

In [28]:
print(spam_df_train['Message'].iloc[7])

K sure am in my relatives home. Sms me de. Pls:-)


In [29]:
print(spam_df_train['Category'].iloc[7])

ham


In [30]:
print(spam_df_train['Message'].iloc[94])

HMV BONUS SPECIAL 500 pounds of genuine HMV vouchers to be won. Just answer 4 easy questions. Play Now! Send HMV to 86688 More info:www.100percent-real.com


In [31]:
print(spam_df_train['Category'].iloc[94])

spam


17. Convert your dataframes into TensorFlow datasets as we did above for the IMDB dataset (including shuffling, grouping into batches of size 32, etc. as above).

In [32]:
import tensorflow as tf
import numpy as np

def spam_gen(df):
    def g():
        for row in df.itertuples():
            text = row.Message
            label = (1 if row.Category=='spam' else 0) 
            tokenized = tokenizer(text, max_length=128, padding='max_length', truncation=True) 
            yield {k: np.array(tokenized[k]) for k in tokenized}, label
    return g

input_names = ['input_ids', 'token_type_ids', 'attention_mask']
data_types = ({k: tf.int32 for k in input_names}, tf.int64)
data_shapes = ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([]))

spam_data_train = tf.data.Dataset.from_generator(
    spam_gen(spam_df_train),
    data_types, data_shapes
).shuffle(100).batch(32).repeat(4)

spam_data_test = tf.data.Dataset.from_generator(
    spam_gen(spam_df_test),
    data_types, data_shapes
).shuffle(100).batch(32).repeat(4)

18. Train a BERT-based classification model on your data. What is the best result you can achieve? Note: If your data is imbalanced, keep in mind what baseline accuracy you would expect if they model was guessing randomly.

In [33]:
kaggle_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
kaggle_model.compile(
    optimizer=Adam(learning_rate=5e-5),
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics='accuracy'
)

In [35]:
kaggle_model.fit(x=spam_data_train, epochs=10, steps_per_epoch=64, validation_steps=16, validation_data=spam_data_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10






<keras.callbacks.History at 0x7f1dad1135d0>

I've achieved a validation accuracy of 0.9941. 
If a baseline model will guess randomly stating a message is not spam, it will get an accuracy of 0.865 since that the amount of the mejority class, but this model will not detect a single spam message.

19. Use your trained model to classify new text that you input yourself. See the commented code below for an example of how this could be done on the spam text classification dataset.

In [36]:
# Here is an example of classifying a spam message:
from scipy.special import softmax
tokenized = tokenizer("Make money while you are a sleep! Join our course and learn how to be a millionaire spending 5 minutes a day!")
logits = kaggle_model.predict({k: np.array(tokenized[k])[None] for k in input_names})[0]
scores = softmax(logits, axis=1)[:, 1]
print(1-scores[0], 'probability that input text is spam')
print(scores[0], 'probability that input text is ham')

0.9983913849573582 probability that input text is spam
0.001608615 probability that input text is ham
