<a href="https://colab.research.google.com/github/pgurazada/advances-in-nlp/blob/main/transfer_learning_finetune_distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

Illustrate how to fine-tune a BERT model for sentiment classification using the transformers package.

Note: This notebook should be run with a GPU. If you have access to a larger GPU, you can increase the training data size.

# Setup

In [1]:
! pip install -q datasets==2.20.0 \
                 accelerate==0.33.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/315.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m307.2/315.1 kB[0m [31m148.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━

In [2]:
import numpy as np
import pandas as pd

import tensorflow as tf

from transformers import create_optimizer

from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification

from sklearn.model_selection import train_test_split
from datasets import Dataset

# Method

Just as with transfer learning with images, pretrained BERT models can be fine-tuned by:

- Importing a pretrained model from HuggingFace and attaching a classifier head.
- Then, we freeze the base BERT model and finetune the dense layer.
- Finally, we unfreeze the base BERT model and finetune the entire model.

The finetuned model is now ready for inference.

# Data

In [3]:
data_file = '/content/drive/MyDrive/PES-NLP/Session/labeled_sentiments_data.tsv'

In [4]:
data_df = pd.read_csv(data_file, sep='\t')

In [5]:
data_df.shape

(25000, 3)

In [6]:
data_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


# Prepare Dataset

In [7]:
train_df, test_df = train_test_split(data_df, test_size=0.2)

In [8]:
sample_train_dataset = Dataset.from_pandas(train_df.sample(1000), split='train')
sample_validation_dataset = Dataset.from_pandas(test_df.sample(1000), split='valid')

In [9]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
def preprocess_function(examples):
    return tokenizer(examples["review"], truncation=True)

In [11]:
tokenized_train_dataset = sample_train_dataset.map(preprocess_function, batched=True)
tokenized_validation_dataset = sample_validation_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
# Ensure the label column exists in the tokenized datasets
# This is an expectation by the transformers package
def add_labels(examples):
    examples['labels'] = examples['sentiment']
    return examples

In [13]:
tokenized_train_dataset = tokenized_train_dataset.map(add_labels, batched=True)
tokenized_validation_dataset = tokenized_validation_dataset.map(add_labels, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [14]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [15]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [16]:
batch_size = 16
num_epochs = 5

In [17]:
tokenized_train_dataset, tokenized_validation_dataset

(Dataset({
     features: ['id', 'sentiment', 'review', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 }),
 Dataset({
     features: ['id', 'sentiment', 'review', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 }))

In [18]:
batches_per_epoch = len(tokenized_train_dataset["sentiment"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    min_lr_ratio=0.001,
    num_warmup_steps=0,
    num_train_steps=total_train_steps
)

# Build Model

In [19]:
model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [20]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# Fine-Tune

Freeze the base BERT model.

In [21]:
model.layers[0].trainable = False

In [22]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 592130 (2.26 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________


In [23]:
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

We use two callbacks - model checkpointing when best accuracy is observed and early stopping if validation accuracy does not improve for 4 epochs.

In [24]:
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    "best_model",
    monitor="val_accuracy",
    mode="max",
    save_best_only=True,
    save_weights_only=True
)

earlystopping = tf.keras.callbacks.EarlyStopping(
    patience=4,
    monitor="val_accuracy",
    restore_best_weights=True
)

In [25]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_train_dataset,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    tokenizer=tokenizer
)

In [26]:
tf_val_set = model.prepare_tf_dataset(
    tokenized_validation_dataset,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    tokenizer=tokenizer
)

In [27]:
model.fit(
    tf_train_set,
    validation_data=tf_val_set,
    epochs=50,
    callbacks=[checkpoint, earlystopping]
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50


<tf_keras.src.callbacks.History at 0x79f51281ac50>

Unfreeze the base BERT model and continue training.

In [28]:
model.layers[0].trainable = True

In [29]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [30]:
model.fit(
    tf_train_set,
    validation_data=tf_val_set,
    epochs=50,
    callbacks=[checkpoint, earlystopping]
)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50


<tf_keras.src.callbacks.History at 0x79f5129d9420>

# Inference

In [33]:
test_inputs = [
    "Awesome movie",
    "Great movie, great plot"
]

In [34]:
tokenized_inputs = tokenizer(test_inputs, return_tensors="np", padding="longest")

outputs = model(tokenized_inputs).logits

classifications = np.argmax(outputs, axis=1)

print(classifications)

[0 1]
