# BERT Fine-Tuning

The aim of this project is to **fine-tune BERT for a text classification task**.

We'll be using the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) (SST-2) dataset of movie reviews and a smaller version of BERT - [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) - developed by HuggingFace.

Our ultimate goal is to improve the accuracy of the model that we developed in the project [Text Classification with BERT](https://github.com/j-n-t/transformers/blob/master/text_classification_with_BERT.ipynb).

We'll start by loading and checking the data.

### Step 1

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

We'll be using the training dataset from our previous project, [Text Classification with BERT](https://github.com/j-n-t/transformers/blob/master/text_classification_with_BERT.ipynb).

In [2]:
df_train = pd.read_csv('./.data/sst/tsv/train.tsv', sep='\t')

#### 3. Check data

In [3]:
df_train.head()

Unnamed: 0,review,label
0,the rock is destined to be the 21st century 's...,positive
1,the gorgeously elaborate continuation of `` th...,positive
2,singer\/composer bryan adams contributes a sle...,positive
3,yet the act is still charming here .,positive
4,whether or not you 're enlightened by any of d...,positive


In [4]:
len(df_train)

6920

In [5]:
df_train['label'].value_counts()

positive    3610
negative    3310
Name: label, dtype: int64

For performance reasons, we'll be using only **half of the training dataset** to fine-tune our model.

In [6]:
df = df_train.sample(frac=0.5, random_state=42).reset_index(drop=True)

In [7]:
df.head(10)

Unnamed: 0,review,label
0,e.t. works because its flabbergasting principa...,positive
1,"from the dull , surreal ache of mortal awarene...",positive
2,a disoriented but occasionally disarming saga ...,positive
3,the only type of lives this glossy comedy-dram...,negative
4,"in the affable maid in manhattan , jennifer lo...",positive
5,"a must for fans of british cinema , if only be...",positive
6,earnest and heartfelt but undernourished and p...,positive
7,"` anyone with a passion for cinema , and indee...",positive
8,"we never feel anything for these characters , ...",negative
9,- style cross-country adventure ... it has spo...,positive


In [8]:
len(df)

3460

In [9]:
df['label'].value_counts()

positive    1849
negative    1611
Name: label, dtype: int64

We have 3460 reviews, from which 1849 are positive and 1611 are negative.

The next step is to use **DistilBERT to tokenize our reviews**.

### Step 2

#### 1. Perform necessary imports

In [10]:
import torch
from transformers import DistilBertTokenizer

#### 2. Load pretrained tokenizer

In [11]:
# using DistilBERT
tokenizer_class, pretrained_weights = (DistilBertTokenizer, 'distilbert-base-uncased')

# load pretrained tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

#### 3. Tokenize reviews

We can start by using the `encode` method to **determine the length of the longest review** after the special tokens have been added. These special tokens include the **\[CLS\]** token - id 101 - at the beginning of each review and the **\[SEP\]** token - id 102 - at the end.

In [12]:
input_ids = df['review'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [13]:
input_ids.head()

0    [101, 1041, 1012, 1056, 1012, 2573, 2138, 2049...
1    [101, 2013, 1996, 10634, 1010, 16524, 12336, 1...
2    [101, 1037, 4487, 21748, 25099, 2094, 2021, 56...
3    [101, 1996, 2069, 2828, 1997, 3268, 2023, 1950...
4    [101, 1999, 1996, 21358, 7011, 3468, 10850, 19...
Name: review, dtype: object

In [14]:
max([len(review) for review in input_ids])

80

In [15]:
np.rint(np.mean([len(review) for review in input_ids]))

25.0

The longest review has 80 tokens and our reviews have on average 25 tokens. We'll set **80 as the maximum length**.

Instead of using the `encode` method we can use the `encode_plus` method. This method will return a dictionary with the encoded sequence and some additional information.

We set the **maximum length to 80, padd all the reviews and get an attention mask** for every review. We also set the `return_tensors` parameter to 'pt' to get PyTorch tensors.

In [16]:
encoded_dict = df['review'].apply((lambda x: tokenizer.encode_plus(x, add_special_tokens=True, 
                                                                   max_length=80, 
                                                                   pad_to_max_length=True, 
                                                                   return_attention_mask=True, 
                                                                   return_tensors='pt')))

In [17]:
encoded_dict[0]

{'input_ids': tensor([[  101,  1041,  1012,  1056,  1012,  2573,  2138,  2049, 13109,  7875,
           4059, 14083,  2075, 27928,  1010,  2403,  1011,  2095,  1011,  2214,
           2728,  6097,  2532, 18533,  2239,  1010,  1020,  1011,  2095,  1011,
           2214,  3881,  6287,  5974,  1998,  2184,  1011,  2095,  1011,  2214,
           2888,  2726,  1010,  8054,  2149,  1997,  1996,  4598,  1997,  1996,
           7968,  1010, 15536, 10431,  2098, 10367,  2013,  1037,  2521,  9497,
           4774,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]])}

We can now store our **input ids** and **attention masks** in two lists and convert them to the necessary format.

In [18]:
input_ids_list = []
attention_masks_list = []

for i, review in enumerate(encoded_dict):
    
    input_ids_list.append(encoded_dict[i]['input_ids'])
    
    attention_masks_list.append(encoded_dict[i]['attention_mask'])

In [19]:
input_ids_list[:2]

[tensor([[  101,  1041,  1012,  1056,  1012,  2573,  2138,  2049, 13109,  7875,
           4059, 14083,  2075, 27928,  1010,  2403,  1011,  2095,  1011,  2214,
           2728,  6097,  2532, 18533,  2239,  1010,  1020,  1011,  2095,  1011,
           2214,  3881,  6287,  5974,  1998,  2184,  1011,  2095,  1011,  2214,
           2888,  2726,  1010,  8054,  2149,  1997,  1996,  4598,  1997,  1996,
           7968,  1010, 15536, 10431,  2098, 10367,  2013,  1037,  2521,  9497,
           4774,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]),
 tensor([[  101,  2013,  1996, 10634,  1010, 16524, 12336,  1997,  9801,  7073,
          19391,  1037, 23751,  2839,  6533,  1012,   102,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,

In [20]:
# concatenate the list of tensors

input_ids = torch.cat(input_ids_list, dim=0)
attention_masks = torch.cat(attention_masks_list, dim=0)

In [21]:
input_ids

tensor([[  101,  1041,  1012,  ...,     0,     0,     0],
        [  101,  2013,  1996,  ...,     0,     0,     0],
        [  101,  1037,  4487,  ...,     0,     0,     0],
        ...,
        [  101,  2035,  1996,  ...,     0,     0,     0],
        [  101,  1037, 20161,  ...,     0,     0,     0],
        [  101,  1996,  4164,  ...,     0,     0,     0]])

In [22]:
len(input_ids)

3460

In [23]:
attention_masks

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

In [24]:
len(attention_masks)

3460

We now have our 3460 reviews and corresponding attention masks in the desired format. Let's convert our labels as well.

#### 4. Convert labels to tensors

In [25]:
df['label'].value_counts()

positive    1849
negative    1611
Name: label, dtype: int64

In [26]:
label_to_num = {'label': {'positive': 1, 'negative': 0}}

In [27]:
df.replace(label_to_num, inplace=True)

In [28]:
df['label'].value_counts()

1    1849
0    1611
Name: label, dtype: int64

In [29]:
labels = torch.tensor(df['label'].values)

In [30]:
labels

tensor([1, 1, 1,  ..., 0, 1, 1])

In [31]:
type(labels)

torch.Tensor

In [32]:
len(labels)

3460

Our labels are now PyTorch tensors. Let's check one example of what we have done so far.

#### 5. Check original review and corresponding tensors

In [33]:
print('Original review: ', df['review'][0]+'\n')
print('Token IDs:', input_ids[0])
print('\n')
print('Attention Mask:', attention_masks[0])
print('\n')
print('Label:', labels[0])

Original review:  e.t. works because its flabbergasting principals , 14-year-old robert macnaughton , 6-year-old drew barrymore and 10-year-old henry thomas , convince us of the existence of the wise , wizened visitor from a faraway planet .

Token IDs: tensor([  101,  1041,  1012,  1056,  1012,  2573,  2138,  2049, 13109,  7875,
         4059, 14083,  2075, 27928,  1010,  2403,  1011,  2095,  1011,  2214,
         2728,  6097,  2532, 18533,  2239,  1010,  1020,  1011,  2095,  1011,
         2214,  3881,  6287,  5974,  1998,  2184,  1011,  2095,  1011,  2214,
         2888,  2726,  1010,  8054,  2149,  1997,  1996,  4598,  1997,  1996,
         7968,  1010, 15536, 10431,  2098, 10367,  2013,  1037,  2521,  9497,
         4774,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])


Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 

Our reviews are now tokenized and padded, and for each one of them we have the corresponding token ids, attention mask and label stored as PyTorch tensors.

It's now time to **split our dataset into a training set and a validation set**. This will be our step 3.

### Step 3

#### 1. Perform necessary imports

In [34]:
from torch.utils.data import TensorDataset, random_split

#### 2. Create dataset made of tensors

In [35]:
dataset = TensorDataset(input_ids, attention_masks, labels)

In [36]:
dataset[0]

(tensor([  101,  1041,  1012,  1056,  1012,  2573,  2138,  2049, 13109,  7875,
          4059, 14083,  2075, 27928,  1010,  2403,  1011,  2095,  1011,  2214,
          2728,  6097,  2532, 18533,  2239,  1010,  1020,  1011,  2095,  1011,
          2214,  3881,  6287,  5974,  1998,  2184,  1011,  2095,  1011,  2214,
          2888,  2726,  1010,  8054,  2149,  1997,  1996,  4598,  1997,  1996,
          7968,  1010, 15536, 10431,  2098, 10367,  2013,  1037,  2521,  9497,
          4774,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 tensor(1))

In [37]:
type(dataset[0])

tuple

For each review, our dataset stores a tuple of PyTorch tensors of the form (input_ids, attention_mask, label).

#### 3. Split dataset

In [38]:
train_size = int(0.7 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(f'{train_size} labeled reviews for training')
print(f'{val_size} labeled reviews for validation')

2422 labeled reviews for training
1038 labeled reviews for validation


We now have a training dataset consisting of 70% of our data and a validation dataset with the other 30%.

Another thing we should do is to **create a generator in order not to load the entire dataset into memory**.

### Step 4

#### 1. Perform necessary imports

In [39]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

#### 2. Create a training generator and a validation generator

In [40]:
# for fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
# A.3 of https://arxiv.org/pdf/1810.04805.pdf
batch_size = 32

train_dataloader = DataLoader(train_dataset, 
                              sampler = RandomSampler(train_dataset), # select batches randomly
                              batch_size = batch_size)

# for validation the order doesn't matter - we'll read them sequentially
val_dataloader = DataLoader(val_dataset, 
                            sampler = SequentialSampler(val_dataset), # select batches sequentially
                            batch_size = batch_size)

We now have everything we need to start **training our model**. This will be our **step 5**.

### Step 5

#### 1. Perform necessary imports

In [41]:
from transformers import DistilBertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup

We'll be using [DistilBertForSequenceClassification](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification).

Our model will be a **DistilBert Model transformer with a sequence classification head on top**.

#### 2. Load pretrained model

In [42]:
model = DistilBertForSequenceClassification.from_pretrained(pretrained_weights, 
                                                            num_labels = 2, # binary classification
                                                            output_attentions = False, 
                                                            output_hidden_states = False)

After loading our model, we need to **set our optimizer**.

#### 3. Set optimizer

In [43]:
# AdamW from the huggingface library
# https://huggingface.co/transformers/main_classes/optimizer_schedules.html#adamw-pytorch

optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)

For fine-tuning BERT on a specific task, the authors recommend a learning rate (Adam) of 5e-5, 3e-5 or 2e-5, and to train the model for 2 to 4 epochs. You can check Appendix 3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) for more details.

The [Adam paper](https://arxiv.org/pdf/1412.6980.pdf) suggests $10^{-8}$  as a good value for epsilon.

We can now define the total number of epochs and the total number of training steps, and create a [learning rate scheduler](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_linear_schedule_with_warmup).

#### 4. Create learning rate scheduler

In [44]:
epochs = 4

# total number of training steps is [number of batches] x [number of epochs] 
total_steps = len(train_dataloader) * epochs

# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # default value in run_glue.py
                                            num_training_steps = total_steps)

As we've seen before, we have 2422 labeled reviews for training. Since we defined a batch size of 32, for each epoch we have (2422/32=75.7) 76 training steps.

We can easily check this by getting the length of our `train_dataloader`.

In [45]:
len(train_dataloader)

76

Since we are training our model for 4 epochs, we have a total number of training steps of 76 x 4 = 304.

Before starting the training of our model, we can also define **two helper functions** to **calculate the accuracy** of our model and to **format elapsed times**.

#### 5. Define helper functions

In [46]:
# function to calculate the accuracy of our model

def flat_accuracy(preds, labels):
    
    # extract prediction from logits
    pred_flat = np.argmax(preds, axis=1).flatten()
    
    # extract label from labels
    labels_flat = labels.flatten()
    
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [47]:
# function to format elapsed times

import datetime

def format_time(elapsed):
    
    # round to the nearest second
    elapsed_rounded = int(round((elapsed)))
    
    # format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

We're finally ready to **start training our model**.

### Step 6

#### 1. Perform necessary imports

In [48]:
import time

#### 2. Train model

In [49]:
# list to store training and validation loss, accuracy and elapsed times
training_stats = []

# time at the beginning of training to measure total training time
total_t0 = time.time()

for epoch in range(0, epochs):
    
    # TRAINING

    print('\n')
    print(f'++++++++++ Epoch {epoch+1} / {epochs} ++++++++++')
    print('\n')
    print('Training...'+'\n')

    # time at the beginning of training epoch to measure this epoch training time
    t0 = time.time()

    # reset the total loss for this epoch
    total_train_loss = 0

    # put the model into training mode
    model.train()

    for step, batch in enumerate(train_dataloader):

        # progress update every 20 batches
        if step % 20 == 0 and not step == 0:
            # calculate elapsed time
            elapsed = format_time(time.time() - t0)
            
            # print progress
            print(f'  Batch {step}  of  {len(train_dataloader)}  -->  Elapsed time: {elapsed}')

        # unpack this training batch from our dataloader
        b_input_ids = batch[0] # input ids
        b_input_masks = batch[1] # attention masks
        b_labels = batch[2] # labels

        # clear any previously calculated gradients
        model.zero_grad()        

        # perform a forward pass (evaluate the model on this training batch)
        # https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
        # it returns the loss and the logits (classification scores before the activation function is applied)
        loss, logits = model(input_ids = b_input_ids, 
                             attention_mask = b_input_masks, 
                             labels = b_labels)

        # accumulate the training loss over all of the batches
        # the .item() method just returns the value of the tensor as a standard Python number
        # https://pytorch.org/docs/stable/tensors.html#torch.Tensor.item
        total_train_loss += loss.item()

        # perform a backward pass to calculate the gradients
        loss.backward()

        # clip the norm of the gradients to 1.0
        # this is to help prevent the "exploding gradients" problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # update parameters and take a step using the computed gradient
        optimizer.step()

        # update the learning rate
        scheduler.step()

    # average training loss over all of the batches
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # elapsed time for this epoch
    training_time = format_time(time.time() - t0)

    print('\n')
    print(f'  Average training loss: {avg_train_loss:.3f}')
    print(f'  Elapsed time for training epoch: {training_time}')
    
    
    # VALIDATION
    
    print('\n')
    print('Validation...'+'\n')

    t0 = time.time()

    # put the model in evaluation mode
    model.eval()

    # create total_accuracy and total_val_loss variables
    # and set them to zero
    total_accuracy = 0
    total_val_loss = 0

    # evaluate data for one epoch
    for batch in val_dataloader:
        
        # unpack this training batch from our dataloader
        b_input_ids = batch[0]
        b_input_masks = batch[1]
        b_labels = batch[2]
        
        # we don'need the gradients, so we don't build the computation graph
        with torch.no_grad():        

            # perform a forward pass
            loss, logits = model(input_ids = b_input_ids, 
                                 attention_mask = b_input_masks, 
                                 labels = b_labels)
            
        # accumulate the validation loss over all of the batches
        total_val_loss += loss.item()

        # convert logits and labels to numpy arrays
        logits = logits.numpy()
        labels = b_labels.numpy()

        # calculate the accuracy for this batch of reviews
        # and accumulate it over all batches
        total_accuracy += flat_accuracy(logits, labels)
        

    # average validation accuracy after this epoch of training
    avg_val_accuracy = total_accuracy / len(val_dataloader)

    # average validation loss over all of the batches
    avg_val_loss = total_val_loss / len(val_dataloader)
    
    # elapsed time for the validation
    validation_time = format_time(time.time() - t0)
    
    print(f'  Accuracy: {avg_val_accuracy:.3f}')
    print(f'  Average validation loss: {avg_val_loss:.3f}')
    print(f'  Elapsed time for validation: {validation_time}')

    # append statistics from this epoch to the training_stats list
    training_stats.append({'epoch': epoch + 1, 
                           'Training Loss': avg_train_loss, 
                           'Validation Loss': avg_val_loss, 
                           'Accuracy': avg_val_accuracy, 
                           'Training Time': training_time, 
                           'Validation Time': validation_time})

print('\n')
print('Training complete!')

print(f'Total training time {format_time(time.time()-total_t0)} (h:mm:ss)')



++++++++++ Epoch 1 / 4 ++++++++++


Training...

  Batch 20  of  76  -->  Elapsed time: 0:02:49
  Batch 40  of  76  -->  Elapsed time: 0:05:39
  Batch 60  of  76  -->  Elapsed time: 0:08:32


  Average training loss: 0.462
  Elapsed time for training epoch: 0:10:45


Validation...

  Accuracy: 0.864
  Average validation loss: 0.355
  Elapsed time for validation: 0:01:27


++++++++++ Epoch 2 / 4 ++++++++++


Training...

  Batch 20  of  76  -->  Elapsed time: 0:02:52
  Batch 40  of  76  -->  Elapsed time: 0:05:43
  Batch 60  of  76  -->  Elapsed time: 0:08:34


  Average training loss: 0.222
  Elapsed time for training epoch: 0:10:51


Validation...

  Accuracy: 0.863
  Average validation loss: 0.350
  Elapsed time for validation: 0:01:25


++++++++++ Epoch 3 / 4 ++++++++++


Training...

  Batch 20  of  76  -->  Elapsed time: 0:02:51
  Batch 40  of  76  -->  Elapsed time: 0:05:40
  Batch 60  of  76  -->  Elapsed time: 0:08:31


  Average training loss: 0.135
  Elapsed time for traini

Our model is trained and it is now time to evaluate it. This will be our final step.

### Step 7

#### 1. Create dataframe with training stats

In [50]:
df_stats = pd.DataFrame(data=training_stats)

In [51]:
# set epoch as index
df_stats = df_stats.set_index('epoch')

# display floats with 3 decimal places
pd.set_option('precision', 3)

In [52]:
df_stats

Unnamed: 0_level_0,Training Loss,Validation Loss,Accuracy,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.462,0.355,0.864,0:10:45,0:01:27
2,0.222,0.35,0.863,0:10:51,0:01:25
3,0.135,0.373,0.875,0:10:47,0:01:24
4,0.094,0.387,0.881,0:10:52,0:01:22


Or model shows some signs of **overfitting** given that our **training loss keeps decreasing** but our **validation loss increases on epochs 3 and 4**.

We get a **maximum accuracy value of 88.1%**, significantly higher than the value we've obtained in our previous project, [Text Classification with BERT](https://github.com/j-n-t/transformers/blob/master/text_classification_with_BERT.ipynb).

Even if we consider our results for **epoch 2**, where the **validation loss is lower** overall, our accuracy value reveals that we've managed to successfully fine-tune our model for this classification task, even though our training set was relatively small.

If you want to know more about BERT fine-tuning, please check the tutorial [BERT Fine-Tuning Tutorial with PyTorch](http://mccormickml.com/2019/07/22/BERT-fine-tuning/) by Chris McCormick and Nick Ryan.