# Hyperparameter tuning of DistillBERT with Classification Head

Link to Colab Notebook: https://colab.research.google.com/drive/1M3twE8OhurhJ8h5IQLE-yNliqdEXHxu0?authuser=3#scrollTo=1ysSqAbGQ0Kb

This notebook aims to improve on the limitation of the paper:
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to Fine-Tune BERT for Text Classification? Computation and Language (Cs.CL). https://doi.org/10.48550/arXiv.1905.05583

The authors did not conduct hyperparameter tuning for their neural networks. For our project, we attempt to choose values for dropout rate and LEARNING_RATE variables used in 'DistillBERT_finetuning_2_target.ipynb' notebook, since learning rate affects convergence while dropout rate affects generalisation ability and train time of model. We used a subset of the training data, and utilized Bayesian Optimisation.

In [1]:
!pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian_optimization-1.4.3-py3-none-any.whl (18 kB)
Collecting colorama>=0.4.6 (from bayesian-optimization)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, bayesian-optimization
Successfully installed bayesian-optimization-1.4.3 colorama-0.4.6


In [2]:
# Import libraries
from google.colab import drive
import numpy as np
import datetime
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader, SequentialSampler
from transformers import DistilBertModel, DistilBertTokenizer, AdamW
import re

# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Import data and extract mini-set (500 rows) for hyperparametertuning. train_split was created with shuffling so no need shuffle again.
drive.mount('/content/drive')
train_split = pd.read_csv('/content/drive/MyDrive/train_data.csv')

Mounted at /content/drive


In [3]:
# Extracting subset of train data
mini_set = train_split[0:500]
val_set = train_split[500:551]

# Convert to binary classification
def good_bad(row):
  if row < 5:
    return 0
  else:
    return 1

mini_set['Sentiment'] = mini_set['Sentiment'].apply(good_bad)
val_set['Sentiment'] = val_set['Sentiment'].apply(good_bad)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mini_set['Sentiment'] = mini_set['Sentiment'].apply(good_bad)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_set['Sentiment'] = val_set['Sentiment'].apply(good_bad)


In [5]:
from bayes_opt import BayesianOptimization
import warnings

# Ignore warnings in run logs
warnings.filterwarnings("ignore")

# Model and custom dataset classes similar to main notebook
# Define Custom Dataset
class CustomDataset(Dataset):
    ''' Custom dataset class defined to create '''

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.Content = dataframe.Text.to_numpy()
        self.targets = dataframe.Sentiment.to_numpy()
        self.max_len = max_len

    # __len__ and __getitem__ methods to create map-style dataset to be interfaced by torch DataLoader method
    def __len__(self):
        return len(self.Content)

    def __getitem__(self, index):
        # Data preprocessing code to remove trailing whitespace, html tags and urls
        Content = re.sub(r'<[^>]+>', '', self.Content[index])
        Content = re.sub(r'https://\S+|www\.\S+', '', Content)
        Content = re.sub(r'br\s', '', Content)
        Content = " ".join(Content.split())

        rating = self.targets[index]

        # Tokenisation of text
        inputs = self.tokenizer.encode_plus(
            Content,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            truncation=True,
            pad_to_max_length=True,
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(rating, dtype=torch.int)
        }

class DistillBERTClass(torch.nn.Module):
    def __init__(self, dropout_val):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(dropout_val)
        self.classifier = torch.nn.Linear(768, 2)

    # Note: DistilBERT outputs a tuple where the first element at index 0
    # represents the hidden-state at the output of the model's last layer.
    # It is a tensor of shape (batch_size, sequence_length, hidden_size=768)
    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

# Function to calcuate the accuracy of the model
def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

# Training Parameters
MAX_LEN = 512
EPOCHS = 5
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Create validation set (fixed for all experiments)
test_params = {'batch_size': 1,
                'shuffle': False,
                'sampler': SequentialSampler(val_set),
                'num_workers': 0
                }
val_data = CustomDataset(val_set, tokenizer, MAX_LEN)
testing_loader = DataLoader(val_data, **test_params)


def train(lr_exponent_val, dropout_val):
    # Ensure train_batch and lr_exponent_val are discrete
    # Transformations to convert parameter inputs to actual model parameters
    dropout = 0.1*int(dropout_val)
    lr_exponent_val = int(lr_exponent_val)
    lr = 1*10**-(lr_exponent_val)

    # Create Dataset and Dataloader
    paramtune_set = CustomDataset(mini_set, tokenizer, MAX_LEN)
    train_params = {'batch_size': 4,
                    'shuffle': True,
                    'num_workers': 0
                    }
    paramtune_loader = DataLoader(paramtune_set, **train_params)

    # Initialize model
    model = DistillBERTClass(dropout)
    model.to(device)

    # Creating the loss function and optimizer
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(params=model.parameters(), lr=lr)

    # Training loop over mini_set
    for epoch in range(EPOCHS):
        model.train()
        for _,data in enumerate(paramtune_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)

            outputs = model(ids, mask)
            loss = loss_function(outputs, targets)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Evaluating model accuracy over test set
    model.eval()
    n_correct,nb_val_examples = 0,0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask)
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)
            nb_val_examples+=targets.size(0)

    run_accu = (n_correct*100)/nb_val_examples

    return run_accu

# Parameters to tune (learning rate and train batch size)
pbounds = {
    'dropout_val': (1,6),
    'lr_exponent_val': (3, 7),
    }

optimizer = BayesianOptimization(
    f=train,
    pbounds=pbounds,
    verbose=2,
    random_state=1,
)

# Bayesian Optimisation Algorithm. init_points parameter initiates 15 random points to explore during search. Helps by diversifying exploration space, increasing chances of finding global maxima.
# n_iter specifies number of iterations of bayesian optimisation to run. Total iterations would be sum of n_iter and init_points.
optimizer.maximize(init_points=15, n_iter=15)

|   iter    |  target   | dropou... | lr_exp... |
-------------------------------------------------
| [0m1        [0m | [0m84.31    [0m | [0m3.085    [0m | [0m5.881    [0m |
| [0m2        [0m | [0m58.82    [0m | [0m1.001    [0m | [0m4.209    [0m |
| [0m3        [0m | [0m43.14    [0m | [0m1.734    [0m | [0m3.369    [0m |
| [0m4        [0m | [0m78.43    [0m | [0m1.931    [0m | [0m4.382    [0m |
| [95m5        [0m | [95m86.27    [0m | [95m2.984    [0m | [95m5.155    [0m |
| [0m6        [0m | [0m82.35    [0m | [0m3.096    [0m | [0m5.741    [0m |
| [0m7        [0m | [0m78.43    [0m | [0m2.022    [0m | [0m6.512    [0m |
| [0m8        [0m | [0m84.31    [0m | [0m1.137    [0m | [0m5.682    [0m |
| [0m9        [0m | [0m84.31    [0m | [0m3.087    [0m | [0m5.235    [0m |
| [0m10       [0m | [0m56.86    [0m | [0m1.702    [0m | [0m3.792    [0m |
| [0m11       [0m | [0m78.43    [0m | [0m5.004    [0m | [0m6.873    

In [6]:
print(optimizer.max)

{'target': 88.23529411764706, 'params': {'dropout_val': 1.33403703170643, 'lr_exponent_val': 5.137147476858718}}


Note: Parameter numbers in table above are not the final parameters. Decimals are truncated and value is transformed to appropriate form (i.e. dropout_val of 1.334 means p=0.1 for dropout layer was used; lr_exponent_val = 5.137 means lr=1e-5 for AdamW optimiser was used. (See first 3 lines of `train` function for transformations).

The cell above prints the optimal hyperparameters. The dropout rate of 0.1 aligns with related works. Hence, learning rate of 1e-5 and p=0.1 will be used for model training.