# Fine-tuning RoBERTa with LoRA on the Yelp Reviews polarity dataset
- This code is inspired by the original implementation of LoRA: https://github.com/microsoft/LoRA
- In this notebook, the RoBERTa base pretrained model is used: https://huggingface.co/FacebookAI/roberta-base

# Dependencies


In [None]:
%%capture
!pip install transformers datasets

In [None]:
from datasets import *
from transformers import RobertaModel, RobertaTokenizer

import torch
import torch.nn.functional as F

In [None]:
from torch.utils.data import DataLoader, Dataset

# Data

The models will be fine-tuned on the [yelp_polarity](https://huggingface.co/datasets/yelp_polarity) dataset.

In [None]:
ds = load_dataset('yelp_polarity')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

In [None]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})


In [None]:
print(ds['train'][0]['text'])
print(ds['train'][0]['label'])

Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.
0


To save on time and memory, create a smaller subset of the full dataset.

In [None]:
TRAIN_SUBSET_SIZE = 30000
TEST_SUBSET_SIZE = 5000

In [None]:
train_dataset = ds['train'].shuffle(seed=42).select(range(TRAIN_SUBSET_SIZE))
test_dataset = ds['test'].shuffle(seed=42).select(range(TEST_SUBSET_SIZE))

### Tokenize

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base", truncation=True, do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [None]:
print("Tokenizer max input length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

Tokenizer max input length: 512
Tokenizer vocabulary size: 50265


In [None]:
MAX_LENGTH = 288

In [None]:
def tokenize_text(batch):
  return tokenizer(batch["text"],
                   padding=True,
                   truncation=True,
                   return_token_type_ids=True,
                   max_length=MAX_LENGTH)

In [None]:
tokenized_train_dataset = train_dataset.map(tokenize_text, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_text, batched=True)

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
print(tokenized_train_dataset)

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 30000
})


In [None]:
print(tokenized_test_dataset)

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5000
})


In [None]:
# Clear some memory
del ds

In [None]:
columns=["label", "input_ids", "attention_mask", "token_type_ids"]

tokenized_train_dataset.set_format("torch", columns=columns)
tokenized_test_dataset.set_format("torch", columns=columns)

### DataSet Class

In [None]:
BATCH_SIZE = 16

In [None]:
class MyDataset(Dataset):
  def __init__(self, dataset, partition_key):
    self.dataset = dataset

  def __getitem__(self, index):
    return self.dataset[index]

  def __len__(self):
    return self.dataset.num_rows

In [None]:
train_data = MyDataset(tokenized_train_dataset, partition_key="train")
test_data = MyDataset(tokenized_test_dataset, partition_key="test")

### Set up DataLoaders

In [None]:
train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(dataset=test_data, batch_size=BATCH_SIZE)

# RoBERTa base model
First, I'll use the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) pretrained base model, and add some classification layers on top of it.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Custom class with RoBERTa model and fully connected layers for classification.

In [None]:
class RobertaWithClassification(torch.nn.Module):
  def __init__(self):
    super(RobertaWithClassification, self).__init__()
    self.roberta = RobertaModel.from_pretrained("roberta-base")
    self.linear = torch.nn.Linear(768, 768)
    self.activation = torch.nn.ReLU()
    self.dropout = torch.nn.Dropout(0.3)
    self.classifier = torch.nn.Linear(768, 2)

  # output of the roberta model:
  # https://huggingface.co/transformers/v3.2.0/main_classes/output.html#basemodeloutputwithpooling
  def forward(self, input_ids, attention_mask, token_type_ids):
    output_with_pooling = self.roberta(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    hidden_state = output_with_pooling[0]
    pooler = hidden_state[:,0]
    pooler = self.linear(pooler)
    pooler = self.activation(pooler)
    pooler = self.dropout(pooler)
    output = self.classifier(pooler)
    return output

In [None]:
model = RobertaWithClassification()

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.to(device)

RobertaWithClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Layer

### Total trainable parameters

In [None]:
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
base_param_count = count_parameters(model)
print(base_param_count)

125237762


### Fine-tuning

In [None]:
import time

In [None]:
lr = 1e-5
EPOCHS = 3

In [None]:
def get_accuracy(y_pred, targets):
  predictions = torch.log_softmax(y_pred, dim=1).argmax(dim=1)
  accuracy = (predictions == targets).sum() / len(targets)
  return accuracy

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=lr)
loss_function = torch.nn.CrossEntropyLoss()

In [None]:
def train(model, train_loader, epochs, optimizer):
  total_time = 0

  for epoch in range(epochs):
    interval = len(train_loader) // 5

    total_train_loss = 0
    total_train_acc = 0

    start = time.time()

    model.train()
    for batch_idx, batch in enumerate(train_loader):
      optimizer.zero_grad()

      input_ids = batch["input_ids"].to(device)
      attention_mask = batch["attention_mask"].to(device)
      token_type_ids = batch["token_type_ids"].to(device)
      labels = batch["label"].to(device)

      outputs = model(input_ids,
                      attention_mask=attention_mask,
                      token_type_ids=token_type_ids)

      loss = loss_function(outputs, labels)
      acc = get_accuracy(outputs, labels)

      total_train_loss += loss.item()
      total_train_acc += acc.item()

      loss.backward()
      optimizer.step()

      if (batch_idx + 1) % interval == 0:
        print("Batch: %s/%s | Training loss: %.4f | accuracy: %.4f" % (batch_idx+1, len(train_loader), loss, acc))

    train_loss = total_train_loss / len(train_loader)
    train_acc = total_train_acc / len(train_loader)

    end = time.time()
    hours, remainder = divmod(end - start, 3600)
    minutes, seconds = divmod(remainder, 60)

    print(f"Epoch: {epoch+1} train loss: {train_loss:.4f} train acc: {train_acc:.4f}")
    print("Epoch time elapsed: {:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), seconds))
    print("")

    total_time += (end - start)

  # Get the average time per epoch
  average_time_per_epoch = total_time / epochs
  hours, remainder = divmod(average_time_per_epoch, 3600)
  minutes, seconds = divmod(remainder, 60)

  print("Average time per epoch: {:0>2}:{:0>2}:{:05.2f}".format(int(hours), int(minutes), seconds))

In [None]:
train(model, train_loader, EPOCHS, optimizer)

Batch: 375/1875 | Training loss: 0.0164 | accuracy: 1.0000
Batch: 750/1875 | Training loss: 0.0276 | accuracy: 1.0000
Batch: 1125/1875 | Training loss: 0.0189 | accuracy: 1.0000
Batch: 1500/1875 | Training loss: 0.0116 | accuracy: 1.0000
Batch: 1875/1875 | Training loss: 0.0857 | accuracy: 1.0000
Epoch: 1 train loss: 0.1393 train acc: 0.9458
Epoch time elapsed: 00:22:40.62

Batch: 375/1875 | Training loss: 0.0225 | accuracy: 1.0000
Batch: 750/1875 | Training loss: 0.0084 | accuracy: 1.0000
Batch: 1125/1875 | Training loss: 0.2103 | accuracy: 0.8750
Batch: 1500/1875 | Training loss: 0.0315 | accuracy: 1.0000
Batch: 1875/1875 | Training loss: 0.0302 | accuracy: 1.0000
Epoch: 2 train loss: 0.0743 train acc: 0.9737
Epoch time elapsed: 00:22:43.61

Batch: 375/1875 | Training loss: 0.0093 | accuracy: 1.0000
Batch: 750/1875 | Training loss: 0.0756 | accuracy: 1.0000
Batch: 1125/1875 | Training loss: 0.0147 | accuracy: 1.0000
Batch: 1500/1875 | Training loss: 0.0081 | accuracy: 1.0000
Batch: 1

### Evaluation

In [None]:
def evaluate(model, test_loader):
  interval = len(test_loader) // 5

  total_test_loss = 0
  total_test_acc = 0

  model.eval()
  with torch.no_grad():
    for batch_idx, batch in enumerate(test_loader):
      input_ids = batch["input_ids"].to(device)
      attention_mask = batch["attention_mask"].to(device)
      token_type_ids = batch["token_type_ids"].to(device)
      labels = batch["label"].to(device)

      outputs = model(input_ids,
                      attention_mask=attention_mask,
                      token_type_ids=token_type_ids)
      loss = loss_function(outputs, labels)
      acc = get_accuracy(outputs, labels)

      total_test_loss += loss.item()
      total_test_acc += acc.item()

      if (batch_idx + 1) % interval == 0:
        print("Batch: %s/%s | Test loss: %.4f | accuracy: %.4f" % (batch_idx+1, len(test_loader), loss, acc))

  test_loss = total_test_loss / len(test_loader)
  test_acc = total_test_acc / len(test_loader)

  print(f"Test loss: {test_loss:.4f} acc: {test_acc:.4f}")
  print("")

In [None]:
evaluate(model, test_loader)

Batch: 62/313 | Test loss: 0.0343 | accuracy: 1.0000
Batch: 124/313 | Test loss: 0.0125 | accuracy: 1.0000
Batch: 186/313 | Test loss: 0.3581 | accuracy: 0.9375
Batch: 248/313 | Test loss: 0.0010 | accuracy: 1.0000
Batch: 310/313 | Test loss: 0.0006 | accuracy: 1.0000
Test loss: 0.0835 acc: 0.9730



# Fine-tuning RoBERTa with LoRA Layers

### LoRA Layer

In [None]:
import math

In [None]:
class LoRALayer(torch.nn.Module):
  def __init__(self, in_dim, out_dim, r, alpha):
    super().__init__()
    self.r = r
    self.alpha = alpha

    # Initialize A to kaiming uniform following code: https://github.com/microsoft/LoRA/blob/main/loralib/layers.py
    self.A = torch.nn.Parameter(torch.empty(r, in_dim))
    # Initialize B to zeros.
    self.B = torch.nn.Parameter(torch.empty(out_dim, r))
    torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
    torch.nn.init.zeros_(self.B)

    self.scaling = self.alpha / self.r

  def forward(self, x):
    x = self.scaling * (x @ self.A.transpose(0, 1) @ self.B.transpose(0, 1))
    return x

In [None]:
class LinearWithLoRA(torch.nn.Module):
  def __init__(self, linear, r, alpha):
    super().__init__()
    self.linear = linear
    self.lora = LoRALayer(
        linear.in_features, linear.out_features, r, alpha
    )

  def forward(self, x):
    return self.linear(x) + self.lora(x)

In [None]:
lora_model = RobertaWithClassification()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freeze the model parameters

In [None]:
for param in lora_model.parameters():
  param.requires_grad = False

Add LoRA to the Query and Value in the Attention layers

In [None]:
from functools import partial

In [None]:
lora_r = 16
lora_alpha = lora_r * 2

assign_lora = partial(LinearWithLoRA, r=lora_r, alpha=lora_alpha)

In [None]:
print(lora_model)

RobertaWithClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Layer

In [None]:
for layer in lora_model.roberta.encoder.layer:
  layer.attention.self.query = assign_lora(layer.attention.self.query)
  layer.attention.self.value = assign_lora(layer.attention.self.value)

Total trainable parameters with LoRA layers

In [None]:
lora_param_count = count_parameters(lora_model)
print("Model with LoRA param count:", lora_param_count)
print("Base model param count:", base_param_count)
print(str(base_param_count // lora_param_count) + " times smaller than base model")

Model with LoRA param count: 589824
Base model param count: 125237762
212 times smaller than base model


Compared to the base model, there are much fewer parameters to train in the model with the LoRA layers: 590K vs 125M.

In [None]:
lora_model.to(device)

RobertaWithClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): LinearWithLoRA(
                (linear): Linear(in_features=768, out_features=768, bias=True)
                (lora): LoRALayer()
              )
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): LinearWithLoRA(
                (linear): Linear(in_features=768, out_features=768, bias=True)
                (lora): LoRALayer()
              )
              (dro

### Fine-tuning

In [None]:
lr = 1e-5
EPOCHS = 3

In [None]:
optimizer_lora = torch.optim.Adam(params=lora_model.parameters(), lr=lr)
loss_function = torch.nn.CrossEntropyLoss()

In [None]:
train(lora_model, train_loader, EPOCHS, optimizer_lora)

Batch: 375/1875 | Training loss: 0.6787 | accuracy: 0.4375
Batch: 750/1875 | Training loss: 0.5992 | accuracy: 0.6875
Batch: 1125/1875 | Training loss: 0.1386 | accuracy: 1.0000
Batch: 1500/1875 | Training loss: 0.2557 | accuracy: 0.8750
Batch: 1875/1875 | Training loss: 0.1141 | accuracy: 0.9375
Epoch: 1 train loss: 0.4168 train acc: 0.7664
Epoch time elapsed: 00:17:04.75

Batch: 375/1875 | Training loss: 0.0992 | accuracy: 1.0000
Batch: 750/1875 | Training loss: 0.1296 | accuracy: 0.9375
Batch: 1125/1875 | Training loss: 0.2146 | accuracy: 0.9375
Batch: 1500/1875 | Training loss: 0.4281 | accuracy: 0.8750
Batch: 1875/1875 | Training loss: 0.2361 | accuracy: 0.8750
Epoch: 2 train loss: 0.1529 train acc: 0.9511
Epoch time elapsed: 00:17:04.99

Batch: 375/1875 | Training loss: 0.2602 | accuracy: 0.9375
Batch: 750/1875 | Training loss: 0.0311 | accuracy: 1.0000
Batch: 1125/1875 | Training loss: 0.2048 | accuracy: 0.8750
Batch: 1500/1875 | Training loss: 0.0291 | accuracy: 1.0000
Batch: 1

### Evaluation

In [None]:
evaluate(lora_model, test_loader)

Batch: 62/313 | Test loss: 0.1162 | accuracy: 0.9375
Batch: 124/313 | Test loss: 0.0343 | accuracy: 1.0000
Batch: 186/313 | Test loss: 0.2203 | accuracy: 0.9375
Batch: 248/313 | Test loss: 0.0185 | accuracy: 1.0000
Batch: 310/313 | Test loss: 0.0147 | accuracy: 1.0000
Test loss: 0.1140 acc: 0.9617



# Results Comparisons

**Average time per epoch**
- Without LoRA: 22 minutes 42.57 seconds
- With LoRA: 17 minutes 4.92 seconds

**Test set Accuracy**
- Without LoRA: 97.30%
- With LoRA: 96.17%

While being 200x smaller than the base model, the LoRA model is about 25% faster during fine-tuning, and has a test set accuracy of 96.17% compared to 97.30% of the base model, after 3 epochs of training.