<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/main/session-8/bert-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/master/session-8/bert-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

<br/>

# Fine-tuning BERT for Text Classification

One of the approach where we can use BERT for downstream task such as text classification is to do fine-tuning of the pretrained model. 

In this lab, we will see how we can use a pretrained DistilBert Model and fine-tune it with custom training data for text classification task. 

At the end of this session, you will be able to:
- prepare data and use model-specific Tokenizer to format data suitable for use by the model
- configure the transformer model for fine-tuning 
- train the model for binary and multi-class text classification


In [3]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 13.0 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 49.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.5 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 37.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |██████

In [4]:
import numpy as np
import tensorflow as tf
import pandas as pd

from transformers import (
    AutoConfig,
    AutoTokenizer,
    TFAutoModelForSequenceClassification,
    TFTrainer,
    TFTrainingArguments,
)
from transformers.utils import logging as hf_logging
from sklearn.model_selection import train_test_split

# We enable logging level to info and use default log handler and log formatting
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()

## Data Preparation

In [1]:
# Uncomment the following if you have not downloaded the datasets.

!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv

--2021-12-11 14:20:49--  https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv
Resolving nyp-aicourse.s3-ap-southeast-1.amazonaws.com (nyp-aicourse.s3-ap-southeast-1.amazonaws.com)... 52.219.40.187
Connecting to nyp-aicourse.s3-ap-southeast-1.amazonaws.com (nyp-aicourse.s3-ap-southeast-1.amazonaws.com)|52.219.40.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13308773 (13M) [text/csv]
Saving to: ‘imdb_test.csv’


2021-12-11 14:20:52 (5.94 MB/s) - ‘imdb_test.csv’ saved [13308773/13308773]

--2021-12-11 14:20:52--  https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv
Resolving nyp-aicourse.s3-ap-southeast-1.amazonaws.com (nyp-aicourse.s3-ap-southeast-1.amazonaws.com)... 52.219.40.187
Connecting to nyp-aicourse.s3-ap-southeast-1.amazonaws.com (nyp-aicourse.s3-ap-southeast-1.amazonaws.com)|52.219.40.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52953551 (50M) [text/csv]
Saving to: ‘i

In [7]:
from datasets import load_dataset

In [69]:
train_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'
test_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
data_files = {"train": train_url, "test": test_url }

In [171]:
raw_dataset = load_dataset('csv', data_files=data_files)

Using custom data configuration default-94f6a67f06cce675
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-94f6a67f06cce675/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/2 [00:00<?, ?it/s]

In [172]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

[INFO|configuration_utils.py:604] 2021-12-11 16:37:50,254 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:641] 2021-12-11 16:37:50,256 >> Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.13.0",
  "vocab_size": 28996
}

[INFO|tokenization_utils_base.py:1742] 2021-12-11 16:37:52,693 >> loading fil

In [173]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 10000
    })
})

In [174]:
from datasets import ClassLabel

labels = ClassLabel(num_classes=2, names=['negative', 'positive'])

In [175]:
labels.str2int('positive')

1

In [176]:
def tokenize_function(sample):
    data_dict = tokenizer(sample['review'], truncation=True)
    data_dict['labels'] = labels.str2int(sample['sentiment'])
    return data_dict

In [177]:
tokenized_trainset  = raw_dataset['train'].map(tokenize_function, batched=True)
tokenized_testset  = raw_dataset['train'].map(tokenize_function, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-94f6a67f06cce675/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-6c57e1926d2ee9e2.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-94f6a67f06cce675/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-6c57e1926d2ee9e2.arrow


In [178]:
split_tokenized_dataset = tokenized_trainset.train_test_split(train_size=0.8, seed=42)

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/csv/default-94f6a67f06cce675/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-874161721109832e.arrow and /root/.cache/huggingface/datasets/csv/default-94f6a67f06cce675/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-e03ac1433c53611d.arrow


In [179]:
split_tokenized_dataset["validation"] = split_tokenized_dataset.pop("test")

In [180]:
split_tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'review', 'sentiment'],
        num_rows: 32000
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'review', 'sentiment'],
        num_rows: 8000
    })
})

In [181]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [182]:
batch_size = 16

tf_train_dataset = split_tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)

tf_validation_dataset = split_tokenized_dataset["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

tf_test_dataset = tokenized_dataset['test'].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=batch_size,
)

In [184]:
tf_train_dataset = tf_train_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
tf_validation_dataset = tf_validation_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
tf_test_dataset = tf_test_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

In [185]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

[INFO|configuration_utils.py:604] 2021-12-11 16:40:21,465 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:641] 2021-12-11 16:40:21,468 >> Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.13.0",
  "vocab_size": 28996
}

[INFO|modeling_tf_utils.py:1521] 2021-12-11 16:40:21,880 >> loading weights f

In [186]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay


num_epochs = 1
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)
from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)

In [187]:
import tensorflow as tf

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

[INFO|configuration_utils.py:604] 2021-12-11 16:40:24,863 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:641] 2021-12-11 16:40:24,867 >> Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.13.0",
  "vocab_size": 28996
}

[INFO|modeling_tf_utils.py:1521] 2021-12-11 16:40:25,288 >> loading weights f

In [188]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=1)

  86/2000 [>.............................] - ETA: 55:30 - loss: 0.4972 - accuracy: 0.7347

KeyboardInterrupt: ignored

In [189]:
model.evaluate(tf_test_dataset)



KeyboardInterrupt: ignored

In [None]:
preds = model.predict(tf_validation_dataset)["logits"]

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-cased".  The tokenizer helps to produce the input tokens that are suitable to be used by the model, e.g. it automatically append the \[CLS\] token in the front of the sentence and the \[SEP\] token at the end of the token, and also the attention mask for those padded positions in the input sequence of tokens.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
#tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

The DistilBERT tokenizer (identical to Bert tokenizer) use WordPiece vocabulary. It has close to 30000 words and it maps pretrained embeddings for each. Each word has its own ids, we would need to map the tokens to those ids.

In [None]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Let us take a closer look at the output of the tokenization process. 

We notice that the tokenizer will return a dictionary of two items 'input_ids' and 'attention_mask'. The input_ids contains the IDs of the tokens. While the 'attention_mask' contains the masking pattern for those padding. If you are using BERT tokenizer, there will be additional item called 'token_type_ids'.

We also notice that for the example sentence, the word 'Transformer' is being broken up into two tokens 'Trans' and '##former'. Similarly, 'Processing' is tokenized as 'Process' and '##ing'.  The '##' means that the rest of the token should be attached to the previous one.

We also see that the tokenizer appended \[CLS\] to the beginning of the token sequence, and \[SEP\] at the end. 

In [None]:
test_sentence = "Transformer is really good for Natural Language Processing."

encoding = tokenizer(test_sentence, padding=True, truncation=True)
print(f"Encoding keys:  {encoding.keys()}\n")

print(f"token ids: {encoding['input_ids']}\n")

print(f"tokens: {tokenizer.convert_ids_to_tokens(encoding['input_ids'])}")


Now let's go ahead and tokenize our texts. But before we do so, we need to convert the pandas series to list first as the tokenizer cannot work with pandas series or dataframe directly. 

In [None]:
train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()
test_texts = test_texts.to_list()
test_labels = test_labels.to_list()

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We then create a tf dataset using the encodings and the labels.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

## Fine-tuning the model

Now let us fine-tune our pre-trained model by training it with our custom dataset.  

We first instantiate a DistilBert config object, and customise it to suit our needs. In our case, we will just specify the *num_labels* to tell the model how many labels to use in the last layer (classification layer). You only need to specify this if you are doing multi-class classification. 

In [None]:
config = AutoConfig.from_pretrained("distilbert-base-uncased", 
                                    num_labels=2)

We then instantiate a DistilBert model using this config object. If the config object is not passed, the default is a binary classification. The model is a a `tf.keras.Model` subclass. So you can train the model using Keras API such as `fit()`, or use Tensorflow custom training loops if you want to have more control over the training. The transformer library however, provides a Trainer class which abstract away the complex training loop, and supports distributed training on multi-GPU system. We will use this to train our model.

To use the Trainer class, we need to setup the training arguments such as number of epochs, batch sizes, warming up steps (commonly used in training Transformer model), weight decay (used to by Adam Optimizer for regularization purpose), learning rate, etc.

In [None]:
training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluate_during_training=True
)

In [None]:
## for distributed training on multi-gpu system, uncomment the following 

with training_args.strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        config=config)


We then define a function `compute_metrics()`  that will be used to compute metrics at evaluation. it takes in a EvalPrediction and return a dictionary string to metric values. In our case we just return the accuracy. 

In [None]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"acc": (preds == p.label_ids).mean()}

In [None]:
# We define a tensorboard writer 
writer = tf.summary.create_file_writer("tblogs")

trainer = TFTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics = compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tb_writer=writer
)

We start the training, and do the evaluation. On a single-GPU system, the training will around 6-7 minutes to complete. 

In [None]:
trainer.train()


In [None]:
trainer.evaluate()

Let's see how it performs on our test set. 

In [None]:
preds = trainer.predict(test_dataset)

The output from predict is logits, so we need to use a softmax to turn the values to probabilities and then use np.argmax to select the label with largest probalities.

In [None]:
tf_predictions = tf.nn.softmax(preds.predictions, axis=-1)

In [None]:
y_preds = np.argmax(tf_predictions, axis=-1)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(preds.label_ids, y_preds))

In [None]:
#model.save_pretrained('./save_model/')

## Try out the model

Now let's try out our model with our own sentence. 

In [None]:
test_sentence = "I don't see how people can sit through this hour-long movie!"
test_sentence = "The movie, though flawed, is still interesting enough."
inputs = tokenizer(test_sentence, return_tensors="tf")
out = model(inputs)
print(np.argmax(tf.nn.softmax(out, axis=-1)))

**Exercise:**

- Try to use BERT base-cased pretrained model and see if you get better or worse performance.
- Try to use BERT base-uncased pretrained model and see if you get better or worse performance.
- Try using a larger number of training samples. 
- Try multi-class classification using the this [dataset](https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/news.csv) that groups news title into 4 categories: e (entertainment), b (business), t (tech), m (medical/health). Original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)