# S&DS 617 Advances in Large Language Models: Theory and Applications: Assignment 3

**Deadline**

Assignment 3 is due Monday, April 7th at 1:30pm.

**Submission**

Submit your assignment as a .pdf on Gradescope. On Gradescope, there are 2 assignments, one where you will submit a pdf file and one where you will submit the corresponding .ipynb that generated it. 
Note: The problems in each homework assignment are numbered. When submitting the pdf on Gradescope, please select the correct pages that correspond to each problem. 

To produce the .pdf, do the following to preserve the cell structure of the notebook:
- Go to "File" at the top-left of your Jupyter Notebook
- Under "Download as", select "HTML (.html)"
- After the .html has downloaded, open it and then select "File" and "Print"
- From the print window, select the option to save as a .pdf

In this assignment, we will fine-tune BERT using LoRA on the Yelp reviews dataset to predict whether a review is positive or negative. Some starter code has been provided to you below. 

### Load data

In [2]:
# Uncomment to install relevant packages if not already installed
!pip install transformers
!pip install torch torchvision
!pip install datasets
!pip install huggingface_hub
!pip install peft==0.4.0 accelerate==0.26.0


Collecting torchvision
  Downloading torchvision-0.21.0-cp311-cp311-macosx_11_0_arm64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting torch
  Downloading torch-2.6.0-cp311-none-macosx_11_0_arm64.whl (66.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 2.5.1
    Uninstalling torch-2.5.1:
      Successfully uninstalled torch-2.5.1
Successfully installed torch-2.6.0 torchvision-0.21.0
Collecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.26.0
  Downloading accelerate-0.26.0-py3-none-any.whl (270 kB)
[

In [4]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m504.6 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.49.0
    Uninstalling transformers-4.49.0:
      Successfully uninstalled transformers-4.49.0
Successfully installed transformers-4.50.3


In [1]:
import datasets
import requests, tarfile, os
import torch
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import roc_auc_score

In [2]:
# URL of the Yelp reviews dataset
_DOWNLOAD_URL = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"

# Function to download and extract the dataset
def download_and_extract_dataset(url):
    # Download the dataset
    local_file = "yelp_review_polarity_csv.tgz"
    response = requests.get(url)
    open(local_file, 'wb').write(response.content)

    # Extract the dataset
    tar = tarfile.open(local_file, "r:gz")
    tar.extractall()
    tar.close()

# Function to load the dataset into pandas DataFrame
def load_dataset():
    # Download and extract if dataset directory doesn't exist
    if not os.path.exists("yelp_review_polarity_csv"):
        download_and_extract_dataset(_DOWNLOAD_URL)

    # Load train and test datasets
    train_data = pd.read_csv("yelp_review_polarity_csv/train.csv", header=None, names=["label", "text"])
    test_data = pd.read_csv("yelp_review_polarity_csv/test.csv", header=None, names=["label", "text"])
    
    # Subsample down
    train_data = train_data.sample(n=5000, random_state=11)
    test_data = test_data.sample(n=1000, random_state=94)
    
    # Adjust labels to be zero-based if necessary
    train_data['label'] -= 1
    test_data['label'] -= 1

    return train_data, test_data

# Load the data
train_data, test_data = load_dataset()

# Display the first few rows of the train and test sets
print(train_data.head())
print(test_data.head())

        label                                               text
478660      0  Not a good place. WAY over proceed for what it...
330978      1  They have the best cocktails! They are all mad...
473311      1  Great pizza. The owner use to work for metro s...
320559      1  Wow. What a great show!\n\nThere are several d...
526693      0  Me and my boyfriend started coming here after ...
       label                                               text
9423       1  Dirty martinis with blue cheese olives!!!!!!!!...
29439      0  My husband had gone to a different location th...
7064       1  Very tasty food! The lady's offer prompt and s...
18986      0  Very so so burgers. Best to head elsewhere for...
16556      0  As a New Yorker, I was looking forward to tryi...


In [3]:
train_data.iloc[0]['text']

'Not a good place. WAY over proceed for what it is, just a COMPLETE waste of time. Just go ANYWHERE else'

In [4]:
# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Dataset class for Yelp Polarity Reviews
class YelpPolarityReviewDataset(Dataset):
    def __init__(self, data, tokenizer, max_len):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Check if the index is valid
        if idx < 0 or idx >= len(self.data):
            raise IndexError(f"Index {idx} is out of bounds for dataset with length {len(self.data)}")
        review = self.data.iloc[idx]['text']
        label = self.data.iloc[idx]['label']
        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Create datasets
max_len = 128
train_dataset = YelpPolarityReviewDataset(train_data, tokenizer, max_len)
test_dataset = YelpPolarityReviewDataset(test_data, tokenizer, max_len)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

In [5]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=1, lora_alpha=1, lora_dropout=0.1
)

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

model = get_peft_model(model, lora_config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Function to compute AUC for binary classification
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert to probabilities using sigmoid (since it's binary classification)
    probs = np.exp(logits) / (1 + np.exp(logits))
    # For binary classification, use the probabilities of the positive class (assumed to be at index 1)
    probs = probs[:, 1]
    # Calculate AUC
    auc = roc_auc_score(labels, probs)
    return {"auc": auc}

# Training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=15,
)

# Initialize the Trainer
trainer = Trainer(
    model=model, 
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Auc
1,0.6679,0.545021,0.821097
2,0.4971,0.327098,0.936059
3,0.3532,0.314753,0.950798
4,0.3024,0.296714,0.956203
5,0.3014,0.300503,0.958212
6,0.2829,0.283367,0.960604
7,0.2812,0.288458,0.961805
8,0.2786,0.284383,0.963149
9,0.2689,0.280952,0.964021
10,0.2875,0.283177,0.964697


TrainOutput(global_step=9375, training_loss=0.31794892008463543, metrics={'train_runtime': 14936.8093, 'train_samples_per_second': 5.021, 'train_steps_per_second': 0.628, 'total_flos': 4935544243200000.0, 'train_loss': 0.31794892008463543, 'epoch': 15.0})

## Problem 1: Comparison of Trainable Parameters
Compare the total number of trainable parameters between the standard fine-tuning approach and the Low-Rank Adaptation (LoRA) method. Detail how LoRA alters the model’s parameter complexity compared to traditional fine-tuning. Discuss the implications of these changes on the model's training efficiency, accuracy, and computational demand. Consider including a discussion on parameter sharing and model capacity in your answer. Under what scenarios might one method be preferred over the other, considering factors such as available computational resources, the complexity of the task, and dataset size?

## Problem 2: Hyperparameter Experimentation with LoRA
In your experimentation with the LoRA configuration, document the process and present your findings in a structured manner. Describe the specific LoRA hyperparameters you adjusted, including the rank (`r`) and the learning rate scale (`lora_alpha`). Test a range of values for each and analyze how changing these hyperparameters affects the model's performance in terms of training convergence speed and final model accuracy. Provide graphs or tables to support your findings