<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/master/session-8/bert-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Fine-tuning BERT for Text Classification

One of the approach where we can use BERT for downstream task such as text classification is to do fine-tuning of the pretrained model. 

In this lab, we will see how we can use a pretrained DistilBert Model and fine-tune it with custom training data for text classification task. 

At the end of this session, you will be able to:
- prepare data and use model-specific Tokenizer to format data suitable for use by the model
- configure the transformer model for fine-tuning 
- train the model for binary and multi-class text classification


In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd

from transformers import (
    AutoConfig,
    AutoTokenizer,
    TFAutoModelForSequenceClassification,
    TFTrainer,
    TFTrainingArguments,
)
from transformers.utils import logging as hf_logging
from sklearn.model_selection import train_test_split

# We enable logging level to info and use default log handler and log formatting
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()

## Data Preparation

In [2]:
# Uncomment the following if you have not downloaded the datasets.

!wget https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv
!wget https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv

--2020-12-17 10:20:36--  https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv
Resolving sdaai-bucket.s3-ap-southeast-1.amazonaws.com... 52.219.36.219
Connecting to sdaai-bucket.s3-ap-southeast-1.amazonaws.com|52.219.36.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13308773 (13M) [text/csv]
Saving to: 'imdb_test.csv'


2020-12-17 10:20:37 (30.2 MB/s) - 'imdb_test.csv' saved [13308773/13308773]

--2020-12-17 10:20:37--  https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv
Resolving sdaai-bucket.s3-ap-southeast-1.amazonaws.com... 52.219.36.219
Connecting to sdaai-bucket.s3-ap-southeast-1.amazonaws.com|52.219.36.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52953551 (50M) [text/csv]
Saving to: 'imdb_train.csv'


2020-12-17 10:20:38 (41.5 MB/s) - 'imdb_train.csv' saved [52953551/52953551]



In [3]:
train_df = pd.read_csv('imdb_train.csv')
test_df = pd.read_csv('imdb_test.csv')

In [4]:
TRAIN_SIZE = 2500
TEST_SIZE = 200 

train_df = train_df.sample(n=TRAIN_SIZE)
test_df = test_df.sample(n=TEST_SIZE)

In [5]:
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [6]:
train_df.sentiment.value_counts()

0    1278
1    1222
Name: sentiment, dtype: int64

In [7]:
train_texts = train_df['review']
train_labels = train_df['sentiment']
test_texts = test_df['review']
test_labels = test_df['sentiment']

In [8]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-cased".  The tokenizer helps to produce the input tokens that are suitable to be used by the model, e.g. it automatically append the \[CLS\] token in the front of the sentence and the \[SEP\] token at the end of the token, and also the attention mask for those padded positions in the input sequence of tokens.

In [9]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
#tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

[INFO|configuration_utils.py:413] 2020-12-17 10:28:54,727 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /Users/markk/.cache/torch/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:449] 2020-12-17 10:28:54,729 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 28996
}

[INFO|tokenization_utils_base.py:1650] 2020-12-17 10:28:55,647 >> loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /

The DistilBERT tokenizer (identical to Bert tokenizer) use WordPiece vocabulary. It has close to 30000 words and it maps pretrained embeddings for each. Each word has its own ids, we would need to map the tokens to those ids.

In [10]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Tokenizer vocab size = 28996
['voices', 'shopping', '1891', 'Neil', 'discovery', '##vo', '##ations', 'burst', 'Baby', 'peaked', 'Brooklyn', 'knocked', 'lift', '##try', 'false', 'nations', 'Hugh', 'Catherine', 'preserved', 'distinguished']


Let us take a closer look at the output of the tokenization process. 

We notice that the tokenizer will return a dictionary of two items 'input_ids' and 'attention_mask'. The input_ids contains the IDs of the tokens. While the 'attention_mask' contains the masking pattern for those padding. If you are using BERT tokenizer, there will be additional item called 'token_type_ids'.

We also notice that for the example sentence, the word 'Transformer' is being broken up into two tokens 'Trans' and '##former'. Similarly, 'Processing' is tokenized as 'Process' and '##ing'.  The '##' means that the rest of the token should be attached to the previous one.

We also see that the tokenizer appended \[CLS\] to the beginning of the token sequence, and \[SEP\] at the end. 

In [11]:
test_sentence = "Transformer is really good for Natural Language Processing."

encoding = tokenizer(test_sentence, padding=True, truncation=True)
print(f"Encoding keys:  {encoding.keys()}\n")

print(f"token ids: {encoding['input_ids']}\n")

print(f"tokens: {tokenizer.convert_ids_to_tokens(encoding['input_ids'])}")


Encoding keys:  dict_keys(['input_ids', 'attention_mask'])

token ids: [101, 13809, 23763, 1110, 1541, 1363, 1111, 6240, 6828, 18821, 1158, 119, 102]

tokens: ['[CLS]', 'Trans', '##former', 'is', 'really', 'good', 'for', 'Natural', 'Language', 'Process', '##ing', '.', '[SEP]']


Now let's go ahead and tokenize our texts. But before we do so, we need to convert the pandas series to list first as the tokenizer cannot work with pandas series or dataframe directly. 

In [12]:
train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()
test_texts = test_texts.to_list()
test_labels = test_labels.to_list()

In [13]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We then create a tf dataset using the encodings and the labels.

In [14]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

In [22]:
type(train_encodings)

transformers.tokenization_utils_base.BatchEncoding

In [25]:
train_texts[0]
train_labels[0]

1

## Fine-tuning the model

Now let us fine-tune our pre-trained model by training it with our custom dataset.  

We first instantiate a DistilBert config object, and customise it to suit our needs. In our case, we will just specify the *num_labels* to tell the model how many labels to use in the last layer (classification layer). You only need to specify this if you are doing multi-class classification. 

In [26]:
config = AutoConfig.from_pretrained("distilbert-base-cased", 
                                    num_labels=2)

[INFO|configuration_utils.py:413] 2020-12-17 10:44:28,802 >> loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /Users/markk/.cache/torch/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
[INFO|configuration_utils.py:449] 2020-12-17 10:44:28,803 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 28996
}



We then instantiate a DistilBert model using this config object. If the config object is not passed, the default is a binary classification. The model is a a `tf.keras.Model` subclass. So you can train the model using Keras API such as `fit()`, or use Tensorflow custom training loops if you want to have more control over the training. The transformer library however, provides a Trainer class which abstract away the complex training loop, and supports distributed training on multi-GPU system. We will use this to train our model.

To use the Trainer class, we need to setup the training arguments such as number of epochs, batch sizes, warming up steps (commonly used in training Transformer model), weight decay (used to by Adam Optimizer for regularization purpose), learning rate, etc.

In [27]:
training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluate_during_training=True
)



In [28]:
## for distributed training on multi-gpu system, uncomment the following 

with training_args.strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-cased",
        config=config)


[INFO|training_args_tf.py:125] 2020-12-17 10:46:34,179 >> Tensorflow: setting up strategy
[INFO|modeling_tf_utils.py:689] 2020-12-17 10:46:35,097 >> loading weights file https://huggingface.co/distilbert-base-cased/resolve/main/tf_model.h5 from cache at /Users/markk/.cache/torch/transformers/fe773335fbb46b412a9093627b6c3235a69c55bad3bd1deee40813cd0a8d0a82.33c483181ffc4c7cbdd0b733245bcc9b479f14f3b2e892f635fe03f4f3a41495.h5
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be 

We then define a function `compute_metrics()`  that will be used to compute metrics at evaluation. it takes in a EvalPrediction and return a dictionary string to metric values. In our case we just return the accuracy. 

In [29]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"acc": (preds == p.label_ids).mean()}

In [30]:
# We define a tensorboard writer 
writer = tf.summary.create_file_writer("tblogs")

trainer = TFTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics = compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tb_writer=writer
)

[INFO|trainer_tf.py:117] 2020-12-17 10:46:48,393 >> You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
[INFO|trainer_tf.py:125] 2020-12-17 10:46:48,394 >> To use comet_ml logging, run `pip/conda install comet_ml` see https://www.comet.ml/docs/python-sdk/huggingface/


We start the training, and do the evaluation. On a single-GPU system, the training will around 6-7 minutes to complete. 

In [31]:
trainer.train()


[INFO|trainer_tf.py:546] 2020-12-17 10:46:51,625 >> ***** Running training *****
[INFO|trainer_tf.py:547] 2020-12-17 10:46:51,626 >>   Num examples = 2000
[INFO|trainer_tf.py:549] 2020-12-17 10:46:51,626 >>   Num Epochs = 1
[INFO|trainer_tf.py:550] 2020-12-17 10:46:51,627 >>   Instantaneous batch size per device = 16
[INFO|trainer_tf.py:551] 2020-12-17 10:46:51,628 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer_tf.py:554] 2020-12-17 10:46:51,629 >>   Gradient Accumulation steps = 1
[INFO|trainer_tf.py:555] 2020-12-17 10:46:51,630 >>   Steps per epoch = 125
[INFO|trainer_tf.py:556] 2020-12-17 10:46:51,630 >>   Total optimization steps = 125
[INFO|trainer_tf.py:320] 2020-12-17 10:49:40,118 >> ***** Running Evaluation *****
[INFO|trainer_tf.py:321] 2020-12-17 10:49:40,125 >>   Num examples = 500
[INFO|trainer_tf.py:322] 2020-12-17 10:49:40,126 >>   Batch size = 64
[INFO|trainer_tf.py:422] 2020-12-17 10:52:31,955 >> {'eval_loss': 0.7798129320144653

In [None]:
trainer.evaluate()

Let's see how it performs on our test set. 

In [None]:
preds = trainer.predict(test_dataset)

The output from predict is logits, so we need to use a softmax to turn the values to probabilities and then use np.argmax to select the label with largest probalities.

In [None]:
tf_predictions = tf.nn.softmax(preds.predictions, axis=-1)

In [None]:
y_preds = np.argmax(tf_predictions, axis=-1)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(preds.label_ids, y_preds))

## Try out the model

Now let's try out our model with our own sentence. 

In [None]:
test_sentence = "I don't see how people can sit through this hour-long movie!"
#test_sentence = "This movie is in every sense flawless."
inputs = tokenizer(test_sentence, return_tensors="tf")
#labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
#print(inputs)
out = model(inputs)
print(np.argmax(tf.nn.softmax(out, axis=-1)))

**Exercise:**

- You can try to use BERT base-cased pretrained model and see if you can get better performance. 
- Try to use BERT base-uncased pretrained model and see if you get better or worse performance.
- You can try using a larger number of training samples. 
- Try multi-class classification using the this [dataset](https://sdaai-bucket.s3-ap-southeast-1.amazonaws.com/datasets/news.csv) that groups news title into 4 categories: e (entertainment), b (business), t (tech), m (medical/health). Original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)