# 13 - Introduction to Transfer Learning
Prepared by Jan Christian Blaise Cruz

DLSU Machine Learning Group

In this simple demo notebook, we'll see how Transformers are used for Transfer Learning to downstream tasks after pretraining.

# Preliminaries

Make sure that you have a GPU. This notebook was tested with a Tesla P100 (16GB GPU). If you use a smaller GPU, make sure to adjust your batch sizes later to ensure that your data will fit in the GPU.

In [None]:
!nvidia-smi

Mon Sep 21 14:14:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download the iMDB dataset, then install the HuggingFace Transformers package. This gives us a lot of prewritted wrappers and helper functions to load pretrained Transformer models.

In [None]:
!wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/imdb/imdb.zip
!unzip imdb.zip && rm imdb.zip
!pip install transformers

We'll use our usual imports, in addition to some new modules from the Transformers package.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as datautils

from transformers import BertTokenizer, BertForSequenceClassification, BertForMaskedLM, DistilBertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup

import time
from tqdm import tqdm

import numpy as np
import pandas as pd

np.random.seed(42)
torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data Processing and Pretraining

Tokenization for Transformers is mainly done using a variant of BPE or byte-pair encoding. BERT in particular uses a variant called WordPiece. To load a pretrained tokenizer for a specific pretrained model, we'll use the following.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Here's a sample sentence to display how the tokenizer handles sequences.

In [None]:
s = "I liked all the jokes hahaha! Taadasafaunknowntoken."

Encode it into the vocabulary's indexes. Each tokenizer in HuggingFace Transformers has its own vocabulary that it uses internally.

In [None]:
out = tokenizer.encode(s)
print(out)

[101, 146, 3851, 1155, 1103, 13948, 5871, 2328, 2328, 106, 22515, 7971, 3202, 8057, 12660, 2728, 6540, 18290, 1424, 119, 102]


Here's the tokenized version.

In [None]:
print([tokenizer.decode([idx]) for idx in out])

['[CLS]', 'I', 'liked', 'all', 'the', 'jokes', 'ha', '##ha', '##ha', '!', 'Ta', '##ada', '##sa', '##fa', '##unk', '##no', '##wn', '##tok', '##en', '.', '[SEP]']


BERT uses a pretraining scheme called Masked Language Modeling (MLM). We won't pretrained BERT in this notebook (it'd take months in one GPU to perform). Instead, we'll illustrate how MLM works.

To load a pretrained model, we simply do the following.

In [None]:
model = BertForMaskedLM.from_pretrained('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


MLM's task is to predict words under the [MASK] tokens. We'll encode a sentence here.

In [None]:
s = "The quick brown fox [MASK] over the lazy dog."
input_tensor = tokenizer.encode(s)

print([tokenizer.decode([idx]) for idx in input_tensor])

input_tensor = torch.LongTensor(input_tensor).unsqueeze(0)

['[CLS]', 'The', 'quick', 'brown', 'fox', '[MASK]', 'over', 'the', 'lazy', 'dog', '.', '[SEP]']


Then pass it to the model.

In [None]:
with torch.no_grad():
    out = model(input_tensor)[0]

This gives us our logits.

In [None]:
out.shape

torch.Size([1, 12, 28996])

Let's see the predictions for each token.

In [None]:
preds = list(out.argmax(2).squeeze(0).numpy())

We can see that the model filled in the mask with a verb that it thinks is a likely word for that mask.

In [None]:
print([tokenizer.decode([idx]) for idx in preds])

['.', 'The', 'quick', 'brown', 'fox', 'loomed', 'over', 'the', 'lazy', 'dog', '.', '.']


To train MLM, we simply have to optimize a cross entropy loss.

In [None]:
criterion = nn.CrossEntropyLoss()

We calculate loss the same way that we do for normal language modeling.

In [None]:
with torch.no_grad():
    loss = criterion(out.flatten(0, 1), input_tensor.flatten(0))

Here's our loss.

In [None]:
loss.item()

4.252188205718994

# Sentiment Classification

For this section, we'll finetune a pretrained DistilBERT model for sentiment classification on the iMDB dataset. Let's load the dataset.

In [None]:
df = pd.read_csv('imdb/train.csv').sample(frac=1.0, random_state=42)
text, labels = list(df['text']), list(df['sentiment'])

HuggingFace Transformers has a way to make tokenization and encoding simple in just one line.

In [None]:
out = tokenizer(text[0], padding='max_length', truncation=True, max_length=512)

This gives us our input ids padded to a maximum sequence length.

In [None]:
print(len(out['input_ids']))
print(out['input_ids'])

512
[101, 2038, 1376, 11826, 119, 146, 1108, 7805, 1199, 2076, 1104, 10729, 5367, 2523, 1133, 1184, 146, 1400, 1108, 3600, 1603, 11826, 1115, 5671, 3839, 1104, 1412, 1159, 119, 2082, 10008, 1292, 5558, 1195, 1138, 1106, 1243, 1154, 1103, 1171, 2650, 2801, 1177, 1195, 1209, 1719, 1631, 12775, 1111, 1172, 1137, 11571, 1165, 1234, 1838, 2033, 1841, 119, 184, 1216, 14723, 1757, 1303, 119, 2160, 1128, 1267, 170, 1374, 2650, 1133, 1152, 1541, 1178, 12254, 1114, 1103, 3981, 1116, 119, 5723, 1112, 1103, 2252, 1676, 1120, 1103, 18976, 2133, 1395, 1108, 13825, 119, 1284, 1486, 1172, 1177, 1195, 1180, 1198, 1293, 7856, 1103, 6516, 1959, 1108, 1105, 1293, 1107, 11470, 19568, 1103, 1207, 4556, 9477, 1108, 119, 1284, 1267, 1103, 1376, 1873, 2566, 1272, 1131, 1209, 1138, 170, 1304, 1353, 1133, 1696, 1648, 1224, 1107, 1103, 2523, 1165, 1155, 26913, 7610, 5768, 119, 157, 3048, 1162, 6945, 1335, 5123, 26977, 1116, 1272, 1195, 1444, 1113, 1107, 2440, 1106, 1815, 1103, 4928, 3075, 119, 1109, 2213, 2564, 1

This also gives us an attention mask as well as token type ids.

The mask is used to remove attention values to the padding tokens. Token type ids are used to identify which sequence a token belongs to. In tasks like entailment where there are two input sequences, this helps the model identify which is sentence 1 and sentence 2.

In [None]:
print(out['attention_mask'])
print(out['token_type_ids'])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

We'll tokenize and encode the entire dataset. The backend for the tokenizers are written in Rust and are guaranteed to be much faster than writing it by hand.

In [None]:
stime = time.time()
tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=512)
print("Time elapsed: {:.2f}s".format(time.time() - stime))

Time elapsed: 54.74s


Here's the number of samples that are tokenized.

In [None]:
len(tokenized['input_ids'])

25000

As usual, we'll split them into training and validation sets.

In [None]:
tr_sz = int(len(text) * 0.7) 

X_train_input, X_train_mask, X_train_types = tokenized['input_ids'][:tr_sz], tokenized['attention_mask'][:tr_sz], tokenized['token_type_ids'][:tr_sz]
X_valid_input, X_valid_mask, X_valid_types = tokenized['input_ids'][tr_sz:], tokenized['attention_mask'][tr_sz:], tokenized['token_type_ids'][tr_sz:]
y_train, y_valid = labels[:tr_sz], labels[tr_sz:]

Convert them to PyTorch tensors afterwards.

In [None]:
X_train_input, X_train_mask, X_train_types = torch.LongTensor(X_train_input), torch.LongTensor(X_train_mask), torch.LongTensor(X_train_types)
X_valid_input, X_valid_mask, X_valid_types = torch.LongTensor(X_valid_input), torch.LongTensor(X_valid_mask), torch.LongTensor(X_valid_types)
y_train, y_valid = torch.LongTensor(y_train), torch.LongTensor(y_valid)

Then make dataloaders.

Make sure to adjust your batch size depending on the memory capacity of your GPU.

In [None]:
bs = 16

train_set = datautils.TensorDataset(X_train_input, X_train_mask, X_train_types, y_train)
valid_set = datautils.TensorDataset(X_valid_input, X_valid_mask, X_valid_types, y_valid)

train_sampler = datautils.RandomSampler(train_set)
train_loader = datautils.DataLoader(train_set, batch_size=bs, sampler=train_sampler)
valid_loader = datautils.DataLoader(valid_set, batch_size=bs, shuffle=False)

Here's one batch.

In [None]:
x_in, x_ma, x_ty, y = next(iter(train_loader))
print(x_in.shape)

torch.Size([16, 512])


# Finetuning

Finetuning for a task is simple, we simply have to affix a head to the pretrained model. HuggingFace Transformers provides prewritten modules that are essentially pretrained model + head. We'll use a ```BertForSequenceClassification``` for sentiment classification.

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
criterion = nn.CrossEntropyLoss()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Passing our inputs to our model will give us our logits.

In [None]:
with torch.no_grad():
    out = model(input_ids=x_in, attention_mask=x_ma, token_type_ids=x_ty)[0]

Here's the output shape.

In [None]:
out.shape

torch.Size([16, 2])

Argmaxing on the first dimension gives us the predictions.

In [None]:
out.argmax(1)

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

For context, here are the correct answers.

At this stage, the model's head still isn't trained, even if the transformer itself has been pretrained. We need to finetune this model in order to induce the correct biases.

In [None]:
y

tensor([1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1])

Loss is similar to how we compute for loss in the RNN-based sentiment classification task.

In [None]:
with torch.no_grad():
    loss = criterion(out, y)
print(loss.item())

0.6767595410346985


# Training

Here's an accuracy function to help us calculate performance.

In [None]:
def accuracy(out, y): 
    with torch.no_grad():
        return torch.mean((out.argmax(1) == y).float()).item()

To conserve on GPU space, we'll use a DistilBERT model instead of a full BERT model. This is a compressed version of BERT that's smaller and lighter. We'll discuss model compression in a future session. For all intents and purposes, this performs the same role as standard BERT, but it's just smaller.

In [None]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2).to(device)
criterion = nn.CrossEntropyLoss()

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier

We'll set the layers that will have weight decay, set the epochs, optimizer settings, and scheduler settings. 

We're using a variant of Adam that accepts weight decay, and use a linear schedule with warmup. This warms up the scheduler from 0 to the top learning rate for the first 10% of training steps, then linearly decay to 0 from there.

In [None]:
weight_decay = 1e-8
learning_rate = 5e-5
epochs = 3

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
     "weight_decay": weight_decay},
    {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
     "weight_decay": 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * steps), num_training_steps=steps)

Then train.

In [None]:
for e in range(1, epochs + 1):
    train_loss, train_acc = 0, 0
    
    model.train()
    for x_in, x_ma, x_ty, y in tqdm(train_loader):
        x_in, x_ma, y = x_in.to(device), x_ma.to(device), y.to(device)

        out = model(input_ids=x_in, attention_mask=x_ma)[0]
        loss = criterion(out, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        train_loss += loss.item()
        train_acc += accuracy(out, y)
    train_loss /= len(train_loader)
    train_acc /= len(train_loader)

    valid_loss, valid_acc = 0, 0
    
    model.eval()
    with torch.no_grad():
        for x_in, x_ma, x_ty, y in tqdm(valid_loader):
            x_in, x_ma, y = x_in.to(device), x_ma.to(device), y.to(device)

            out = model(input_ids=x_in, attention_mask=x_ma)[0]
            loss = criterion(out, y)

            valid_loss += loss.item()
            valid_acc += accuracy(out, y)
    valid_loss /= len(valid_loader)
    valid_acc /= len(valid_loader)

    print("\nEpoch {:3} | Train Loss {:.4f} | Train Acc {:.4f} | Valid Loss {:.4f} | Valid Acc {:.4f}".format(e, train_loss, train_acc, valid_loss, valid_acc))

100%|██████████| 1094/1094 [07:53<00:00,  2.31it/s]
100%|██████████| 469/469 [01:06<00:00,  7.06it/s]
  0%|          | 0/1094 [00:00<?, ?it/s]


Epoch   1 | Train Loss 0.3226 | Train Acc 0.8561 | Valid Loss 0.2269 | Valid Acc 0.9147


100%|██████████| 1094/1094 [07:54<00:00,  2.31it/s]
100%|██████████| 469/469 [01:06<00:00,  7.06it/s]
  0%|          | 0/1094 [00:00<?, ?it/s]


Epoch   2 | Train Loss 0.1479 | Train Acc 0.9452 | Valid Loss 0.2260 | Valid Acc 0.9181


100%|██████████| 1094/1094 [07:54<00:00,  2.31it/s]
100%|██████████| 469/469 [01:06<00:00,  7.07it/s]


Epoch   3 | Train Loss 0.0435 | Train Acc 0.9874 | Valid Loss 0.2888 | Valid Acc 0.9221





Our final validation accuracy is at 92%, which is way higher than our previous benchmark using RNNs!