<a href="https://colab.research.google.com/github/nyp-sit/iti107/blob/main/wip/bert-finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT for Text Classification

One of the approaches where we can use BERT for downstream task such as text classification is to do fine-tuning of the pretrained model. 

In this lab, we will see how we can use a pretrained DistilBert Model and fine-tune it with custom training data for text classification task. 

At the end of this session, you will be able to:
- prepare data and use model-specific Tokenizer to format data suitable for use by the model
- configure the transformer model for fine-tuning 
- train the model for binary and multi-class text classification


### Install Hugging Face Transformers library

If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [None]:
!pip install transformers==4.6.1

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd

from transformers import (
    AutoConfig,
    AutoTokenizer,
    TFAutoModelForSequenceClassification,
    TFTrainer,
    TFTrainingArguments,
    TFDistilBertForSequenceClassification
)
from transformers.utils import logging as hf_logging
from sklearn.model_selection import train_test_split

# We enable logging level to info and use default log handler and log formatting
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()

## Data Preparation

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv

In [None]:
train_df = pd.read_csv('imdb_train.csv')
test_df = pd.read_csv('imdb_test.csv')

In [None]:
train_df.head()

The train set has 40000 samples. We will use only a small subset (e.g. 2000) samples for finetuning our pretrained model. Similarly we will use a smaller test set for evaluating our model.  We use dataframe's `sample()` to randomly select a subset of samples.

In [None]:
TRAIN_SIZE = 2000
TEST_SIZE = 200 

train_df = train_df.sample(n=TRAIN_SIZE)
test_df = test_df.sample(n=TEST_SIZE)

We now convert the text label into numeric values of 0 (negative) and 1 (positive) 

In [None]:
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [None]:
train_df.sentiment.value_counts()

In [None]:
train_texts = train_df['review']
train_labels = train_df['sentiment']
test_texts = test_df['review']
test_labels = test_df['sentiment']

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  The tokenizer helps to produce the input tokens that are suitable to be used by the DistilBert model, e.g. it automatically append the \[CLS\] token in the front of the sequence of tokens and the \[SEP\] token at the end of the sequence of tokens , and also the attention mask for those padded positions in the input sequence of tokens.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

The DistilBERT tokenizer (identical to Bert tokenizer) use WordPiece vocabulary. It has close to 30000 words and it maps pretrained embeddings for each. Each word has its own ids, we would need to map the tokens to those ids.

In [None]:
print(f"Tokenizer vocab size = {tokenizer.vocab_size}")
print(list(tokenizer.vocab.keys())[6000:6020])

Let us take a closer look at the output of the tokenization process. 

We notice that the tokenizer will return a dictionary of two items 'input_ids' and 'attention_mask'. The input_ids contains the IDs of the tokens. While the 'attention_mask' contains the masking pattern for those padded positions. If you are using BERT tokenizer, there will be additional item called 'token_type_ids'.

We also notice that for the example sentence, the word 'Transformer' is being broken up into two tokens 'Trans' and '##former'. Similarly, 'Processing' is tokenized as 'Process' and '##ing'.  The '##' means that the rest of the token should be attached to the previous one.

We also see that the tokenizer appended \[CLS\] to the beginning of the token sequence, and \[SEP\] at the end. 

In [None]:
test_sentence = "Transformer is really good for Natural Language Processing."

encoding = tokenizer(test_sentence, padding=True, truncation=True)
print(f"Encoding keys:  {encoding.keys()}\n")

print(f"token ids: {encoding['input_ids']}\n")
print(f"attention_mask: {encoding['attention_mask']}\n")
print(f"tokens: {tokenizer.convert_ids_to_tokens(encoding['input_ids'])}")


Now let's go ahead and tokenize our texts. But before we do so, we need to convert the pandas series to list first as the tokenizer cannot work with pandas series or dataframe directly. 

In [None]:
train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()
test_texts = test_texts.to_list()
test_labels = test_labels.to_list()

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We then create a tensorflow dataset using the encodings and the labels.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

## Fine-tuning the model

Now let us fine-tune the pre-trained model by training it with our custom dataset.  

We will instantiate a pretrained model 'distilbert-base-uncased', using `TFAutoModelForSequenceClassification`, and passing `num_labels=2` to indicate we want to train a 2-class (binary) classifier.

The model is a `tf.keras.Model` subclass. So you can train the model using Keras API such as `fit()`, or use Tensorflow custom training loops if you want to have more control over the training. The transformer library however, provides a Trainer class which abstract away the complex training loop, and supports distributed training on multi-GPU system. We will use this to train our model.

To use the Trainer class, we need to setup the training arguments such as *number of epochs*, *batch sizes*, *warming up steps* (commonly used in training Transformer model), *weight decay* (used to by Adam Optimizer for regularization purpose), *learning rate*, etc.

In [None]:
training_args = TFTrainingArguments("./my_train", 
                                    evaluation_strategy='steps',
                                    eval_steps=50,
                                    num_train_epochs=1,
                                    logging_steps=10)

In [None]:
with training_args.strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=2)

We then define a function `compute_metrics()`  that will be used to compute metrics at evaluation. it takes in a EvalPrediction and return a dictionary of  string to metric values. In our case we just return the accuracy. 

In [None]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"acc": (preds == p.label_ids).mean()}

In [None]:
# We define a tensorboard writer 
writer = tf.summary.create_file_writer("tblogs")

trainer = TFTrainer(
    model=model, 
    args=training_args, 
    compute_metrics=compute_metrics,
    train_dataset=train_dataset, 
    eval_dataset=val_dataset,
    tb_writer=writer
)

We start the training, and do the evaluation. On a single-GPU system, the training will around 6-7 minutes to complete. 

In [None]:
trainer.train()


In [None]:
%reload_ext tensorboard 
%tensorboard --logdir tblogs

In [None]:
trainer.evaluate()

Let's see how it performs on our test set. 

In [None]:
preds = trainer.predict(test_dataset)

The output from predict is logits, so we need to use a softmax to turn the values to probabilities and then use np.argmax to select the label with largest probalities.

In [None]:
tf_predictions = tf.nn.softmax(preds.predictions, axis=-1)

In [None]:
y_preds = np.argmax(tf_predictions, axis=-1)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(preds.label_ids, y_preds))

In [None]:
model.save_pretrained('./finetuned_model/')

## Try out the model

Now let's try out our model with our own sentence. 

We first load our saved fined-tuned model.

In [None]:
my_model = TFAutoModelForSequenceClassification.from_pretrained(
        "./finetuned_model")

In [None]:
test_sentence = "I don't see how people can sit through this hour-long movie!"
#test_sentence = "The movie, though flawed, is still interesting enough."
#test_sentence = 'The movie kept we on my toe all the time.'
inputs = tokenizer(test_sentence, return_tensors="tf")
out = my_model(inputs)
print(out)

In [None]:
print(np.argmax(tf.nn.softmax(out.logits, axis=-1)))

**Exercise:**

- Try to use DistilBERT base-cased pretrained model and see if you get better or worse performance.
- Try using a larger number of training samples. 
- Try multi-class classification using the this [dataset](https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/it3103/news.csv) that groups news title into 4 categories: e (entertainment), b (business), t (tech), m (medical/health). Original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)