# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> Data Science 2: Advanced Topics in Data Science
## Homework 5 Part 1: Fine Tuning Llama3


**Harvard University**<br/>
**Spring 2025**<br/>
**Instructors**: Pavlos Protopapas, Natesh Pillai, and Chris Gumb


<hr style="height:2pt">

In [1]:
# RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/"
    "content/styles/cs109.css"
).text
HTML(styles)

In [2]:
%%capture
!pip install datasets
!pip install -U bitsandbytes
import bitsandbytes as bnb
import os
from google.colab import userdata
userdata.get('HF_TOKEN')
hf_token = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = hf_token

In [3]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np

from datasets import Dataset
from scipy.special import softmax
from sklearn.preprocessing import LabelEncoder
from transformers import (
    BitsAndBytesConfig,
    LlamaPreTrainedModel,
    LlamaModel,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    set_seed
)
from transformers import LlamaPreTrainedModel, LlamaModel
from transformers.modeling_outputs import SequenceClassifierOutput
from huggingface_hub import notebook_login, login
import torch
import torch.nn as nn

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from matplotlib import pyplot as plt

import time
import random
import os
import json
import gc
from collections import namedtuple

def setup_seed(seed):
    '''Sets the seed of the entire notebook so results are the same every time we run.'''
    random.seed(seed)
    set_seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)

setup_seed(42)

In [4]:
# Some PyTorch settings
print(f"PyTorch version: {torch.__version__}")
print(f"GPU availablity: \n{torch.cuda.is_available()}\n")

TRAIN_CSV = "data/chatbot_arena_conversations.csv"
model_path = "meta-llama/Llama-3.2-3B-Instruct"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MAX_LENGTH = 1024

PyTorch version: 2.6.0+cu124
GPU availablity: 
True



In [5]:
# measure notebook runtime
time_start = time.time()

<div style = "background: lightsalmon; border: thin solid black; border-radius: 2px; padding: 5px">

### Instructions
- To submit your notebook, follow the instructions given in on the Canvas assignment page.
- Plots should be legible and interpretable *without having to refer to the code that generated them*. They should includelabels for the $x$- and $y$-axes as well as a descriptive title and/or legend when appropriate.
- When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you believe the plot *means*.
- Autograding tests are mostly to help you debug. The tests are not exhaustive so simply passing all tests may not be sufficient for full credit.
- The use of *extremely* inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- We have tried to include all the libraries you may need to do the assignment in the imports cell provided below. Please get course staff approval before importing any additional 3rd party libraries.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells as needed.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells (note that this can take a few minutes).
- **You should do a "Restart Kernel and Run All Cells" before submitting to ensure (1) your notebook actually runs and (2) all output is visible**
</div>


<a id="contents"></a>

## Notebook Contents

- [**PART 1: Fine Tuning LLM**](#part1)


## About this Homework

In this homework, we will explore how to fine-tune an LLM for a classification task. In particular we will be fine tuning a [Llama3.2 3B model](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)

- In [PART 1](#part1), we will begin by building a classifier for [Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)


**IMPORTANT NOTES:**

- LLMs are insanely computationally intensive.
- **We highly recommend that you train your model on a system using GPUs. For this, we recommend using the [GPU-enabled Jupyter environment](https://ood.huit.harvard.edu/pun/sys/dashboard/batch_connect/sys/ood-jupyterlab-spack-conda/cs1090b/session_contexts/new) provided to you as part of this course.**
- Models that take hours to train on CPUs can be trained in just minutes when using GPUs.
- **To avoid getting frustrated by having to re-train your models every time you run your notebook, you should save your trained model weights for later use.** Model history dictionaries can also be saved to disk with `pickle` and checked with an `if not` condition. This is a great way to check if the model weights exist before training, preventing redundant retraining. Please, think of the penguins! 🐧

**KERNEL CRASHES:**

If your kernel crashes as you attempt to train your model, please check the following items:
- Models with too many parameters might not fit in GPU memory. Try reducing the size of your model.
- A large `batch_size` will attempt to load too many samples in GPU memory. Avoid using a very large batch size.
- Avoid creating multiple copies of the data.



<a id="part1"></a>
    
<!-- <div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA"> -->

# PART 1: Fine-Tuning LLMs (50 Points)


<a id="part1intro"></a>

## Overview

[Return to contents](#contents)

In this question, you will work with the [Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) to build a preference prediction model. Your task is to predict which responses users will prefer when presented with two competing answers generated by different large language models (LLMs). The dataset contains a curated subset of conversations from the [Chatbot Arena platform](https://lmarena.ai/), where human users evaluate responses from various LLMs in head-to-head comparisons. You will fine tune Llama 3.2 3B instruct model that can analyze the user's prompt and the generated responses to determine which response users are more likely to prefer.

Learn more about Chatbot Arena here -> [https://arxiv.org/abs/2403.04132](https://arxiv.org/abs/2403.04132)

[Llama 3.2 Models](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices//)

**Note:** The dataset contains conversations that may be unsafe, offensive, or upsetting.

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<a id="q11"></a>

<b>1.1 Initial Preprocessing</b>
<hr>

<b>Q1.1.1 - Preprocessing</b>

<a id="q111"></a>

- We provide the code for this part. We will use a separate script for this part - which will be executed just once.
```python
%%writefile dataset_format.py
## preprocessing code here
```
- `dataset_fomat.py` script does following -

- Loads [Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) using [HuggingFace Dataset](https://huggingface.co/docs/datasets/v3.4.1/en/package_reference/loading_methods#datasets.load_dataset)
- Converts dataset to pandas dataframe
- Filters dataset for English language only
- Separates conversations into - prompt, response_a (this is response from model_a) and response_b (this is response from model_b). Take a closer look at prompts, response_a and response_b - they are all lists, some lists have multiple prompts - these are called multi turn prompts.
- To simplify the problem at hand, we drop all rows where there are no clear winners i.e drop rows with winner=='tie'.  Dataset contains much fewer samples where winner==tie.
- Keeps only single turn conversations. Multiturn conversations can be very long and much harder to predict correctly.
- Apply `utf-8` encoding/decoding for consistency and to avoid any potential tokenization errors.

</div>


In [6]:
%%writefile dataset_format.py

import numpy as np
import pandas as pd
import json

from tqdm.auto import tqdm
tqdm.pandas()

# pip install datasets
from datasets import load_dataset

# Load the chatbot arena conversations dataset from HuggingFace
dataset = load_dataset("lmsys/chatbot_arena_conversations", token=True)
print(dataset)
# Convert the dataset to a pandas DataFrame
df = dataset["train"].to_pandas()
df = df[df.language=='English'].reset_index(drop=True)
print(df.shape)

def separate_conversations(conv):
    user_texts      = [x['content'] for x in conv if x['role'] == 'user']
    assistant_texts = [x['content'] for x in conv if x['role'] == 'assistant']

    return user_texts, json.dumps(assistant_texts)


df['prompt_a'], df['response_a'] = zip(*df.conversation_a.progress_apply(separate_conversations))
df['prompt_b'], df['response_b'] = zip(*df.conversation_b.progress_apply(separate_conversations))


assert (df['prompt_a'] == df['prompt_b']).all() == True

df['prompt'] = df['prompt_a'].progress_apply(json.dumps)

def one_hot_encode(winner):
    return pd.Series([int('model_a'==winner), int('model_b'==winner), int('tie'==winner or 'tie (bothbad)'==winner)])

df[['winner_model_a', 'winner_model_b', 'winner_tie']] = df.winner.progress_apply(one_hot_encode)

cols = ['question_id', 'model_a', 'model_b', 'prompt', 'response_a',
        'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie']

df = pd.DataFrame(df[cols].copy().values, columns=cols).reset_index(drop=True)

## Remove ties
df = df[df.winner_tie==0].reset_index(drop=True)
df.drop(columns=['winner_tie'], inplace=True)

df['id'] = df.index

df['prompt_length'] = df['prompt'].apply(lambda x: len(eval(x)))
df['response_a_length'] = df['response_a'].apply(lambda x: len(eval(x)))
df['response_b_length'] = df['response_b'].apply(lambda x: len(eval(x)))

df = df[df['prompt_length'] == 1].reset_index(drop=True)

df['prompt'] = df['prompt'].apply(lambda x: eval(x)[0])
df['response_a'] = df['response_a'].apply(lambda x: eval(x)[0])
df['response_b'] = df['response_b'].apply(lambda x: eval(x)[0])

df['prompt'] = df['prompt'].apply(lambda x: x.encode('utf-8', 'ignore').decode('utf-8'))
df['response_a'] = df['response_a'].apply(lambda x: x.encode('utf-8', 'ignore').decode('utf-8'))
df['response_b'] = df['response_b'].apply(lambda x: x.encode('utf-8', 'ignore').decode('utf-8'))

df.drop(columns=['prompt_length', 'response_a_length', 'response_b_length'], inplace=True)

print(df.shape)
print(df.head(1).T)

df.to_csv('data/chatbot_arena_conversations.csv', index=False)


Writing dataset_format.py


In [7]:
if not os.path.exists('data/chatbot_arena_conversations.csv'):
    !mkdir -p data
    !python dataset_format.py

README.md: 100% 7.00k/7.00k [00:00<00:00, 29.9MB/s]
(…)-00000-of-00001-cced8514c7ed782a.parquet: 100% 41.6M/41.6M [00:00<00:00, 88.6MB/s]
Generating train split: 100% 33000/33000 [00:00<00:00, 34948.69 examples/s]
DatasetDict({
    train: Dataset({
        features: ['question_id', 'model_a', 'model_b', 'winner', 'judge', 'conversation_a', 'conversation_b', 'turn', 'anony', 'language', 'tstamp', 'openai_moderation', 'toxic_chat_tag'],
        num_rows: 33000
    })
})
(29206, 13)
100% 29206/29206 [00:00<00:00, 43554.11it/s]
100% 29206/29206 [00:00<00:00, 53193.07it/s]
100% 29206/29206 [00:00<00:00, 144597.20it/s]
100% 29206/29206 [00:04<00:00, 6296.50it/s] 
(18320, 9)
                                                                0
question_id                      58210e39b3fd4441a2bd4a518bb44c2d
model_a                                                chatglm-6b
model_b                                                 koala-13b
prompt            What is the difference between OpenCL and

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">
    
<b>Q1.1.2</b>

<a id="q112"></a>
**Further preprocessing and loading dataset**
- Load the dataset from `data/chatbot_arena_conversations.csv` file in a pandas dataframe.
- Create binary label `label` from the winner columns (`winner_model_a` or `winner_model_b`).
Note: `winner_model_b==1` => `label=1` => Model b is winner.
- Perform preliminary EDA
- Display dataset shape, `.head()` and target class ratio.

</div>

In [8]:
df = pd.read_csv('data/chatbot_arena_conversations.csv')
df['label'] = (df['winner_model_b']==1).astype(int)

# preliminary EDA
print("Dataset shape:", df.shape)
dist = df['label'].value_counts().rename({0:'model_a wins', 1:'model_b wins'})
print("\nClass distribution:")
print(pd.DataFrame({
    'count':    dist,
    'fraction': dist / len(df)
})) # we find that the classes are approximately evenly split which is ideal

df.head()

Dataset shape: (18320, 10)

Class distribution:
              count  fraction
label                        
model_a wins   9252  0.505022
model_b wins   9068  0.494978


Unnamed: 0,question_id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,id,label
0,58210e39b3fd4441a2bd4a518bb44c2d,chatglm-6b,koala-13b,What is the difference between OpenCL and CUDA?,OpenCL and CUDA are two different programming ...,OpenCL and CUDA are both programming languages...,0,1,0,1
1,90bfd142157948aba01931726c888e7f,koala-13b,oasst-pythia-12b,"Fuji vs. Nikon, which is better?",Both Fuji and Nikon are popular camera brands ...,This is a subjective question and the answer d...,0,1,1,1
2,a7c5accc53e649a3bc6b2e41d962ebc4,vicuna-13b,oasst-pythia-12b,How to build an arena for chatbots?,Building an arena for chatbots can be done by ...,Building an arena for chatbots is a great way ...,0,1,2,1
3,adf27e819a3c494cb6e993f0c660e097,vicuna-13b,koala-13b,When is it today?,"I'm sorry, I cannot determine the current date...","Today is February 23, 2023.",1,0,3,0
4,c0fc42c6f5f14f2aa5a89f71f8553730,vicuna-13b,koala-13b,Count from 1 to 10 with step = 3,"1, 4, 7, 10\n\nCounting with a step of 3 means...","1, 4, 7, 10",1,0,4,0


<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">
    
<b>Q1.1.3 - Filter the dataset</b>

<a id="q113"></a>
Instantiate tokenizer using `AutoTokenizer`. Analyze the token lengths of prompts and responses using the tokenizer. Filter the dataset to remove very long sequences that might cause memory issues during training. Specifically, limit prompt tokens to less than 100 and both response types to less than 400 tokens i.e `response_a` should be less than 400 tokens and `response_b` should also be less than 400 tokens.
Display number of samples in the dataset with `dataframe.shape`

Tokenizer Settings -
- Be sure to set `tokenizer.pad_token = tokenizer.eos_token`
- Also recommended for good practice  `tokenizer.padding_side = "right"` and `tokenizer.add_eos_token = True`

*Note:* Reason we are limiting the number of tokens - longer texts will require more GPU memory and computation time. Assuming JupyterHub GPU, we believe total maxlen of 1024 tokens per sample should suffice.

</div>

In [9]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# function to help count tokens
def count_tokens(text):
    tokens = tokenizer(str(text), add_special_tokens=True).input_ids
    return len(tokens)

df['prompt_len'] = df['prompt'].apply(count_tokens)
df['resp_a_len'] = df['response_a'].apply(count_tokens)
df['resp_b_len'] = df['response_b'].apply(count_tokens)

mask = (
    (df['prompt_len'] < 100) &
    (df['resp_a_len'] < 400) &
    (df['resp_b_len'] < 400)
)
train = df[mask].reset_index(drop=True)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [10]:
print("\nDataset shape after filtering:", train.shape)
print("\nLabel distribution after filtering:")
print(train.label.value_counts())


Dataset shape after filtering: (14223, 13)

Label distribution after filtering:
label
0    7161
1    7062
Name: count, dtype: int64


<!-- BEGIN QUESTION -->

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">
<b>Q1.1.4</b>


<a id="q114"></a>
- At this point you should have ~14K samples. We are going to randomly sample (to save on training time) 600 samples of each label type ie. 600 samples with winner='model_a' and 600 samples winner='model_b'. Be sure to set random_state=42.

- Split the dataset into train and validation set with 80/20 split, stratify with label and random_state=42.

- We are also going to sort (ascending=False) the validation set by token length ie. sum prompt_tokens_len, response_a_tokens_len and response_b_tokens_len and then sort.
Why should we sort the validation set ?

</div>


In [11]:
df_a = train[train.label == 0].sample(600, random_state=42)
df_b = train[train.label == 1].sample(600, random_state=42)
df_combined = pd.concat([df_a, df_b]).reset_index(drop=True)

train_df, val_df = train_test_split(
    df_combined,
    test_size=0.2,
    stratify=df_combined.label,
    random_state=42
)

val_df['total_len'] = (
    val_df.prompt_len +
    val_df.resp_a_len +
    val_df.resp_b_len
)
val_df = val_df.sort_values('total_len', ascending=False).reset_index(drop=True)

By sorting with ascending = False, we first train on the longer examples and then gradually get to shorter examples. This may help debugging as if we get out of memory issues, we catch it instantly. For instance, if the first example works, then the rest of the examples are most liekly going to work (at east the memory/length issue).

In [12]:
print("\nTraining set shape:", train_df.shape)
print("Validation set shape:", val_df.shape)
val_df.head()


Training set shape: (960, 13)
Validation set shape: (240, 14)


Unnamed: 0,question_id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,id,label,prompt_len,resp_a_len,resp_b_len,total_len
0,6b31320632de4e0bbf9b63d3cc78c58e,claude-v1,oasst-pythia-12b,A table-tennis championship for $2^n$ players ...,* There are $2^n$ players total in the tournam...,To calculate the probability that two players ...,1,0,11885,0,69,392,306,767
1,129c71554ba942b3b3105c15932d4df7,gpt-3.5-turbo,chatglm-6b,"write me a story about Fur, who is a lonely sc...",Fur was a lonely school girl who had a hard ti...,Fur was a lonely school girl with no friends. ...,1,0,14012,0,49,377,338,764
2,5cb092594ca7413a9f7f57126de525ed,claude-v1,oasst-pythia-12b,What are some issues with the science of Back ...,There are several issues with the science depi...,Back to the Future is a 1985 American science ...,1,0,4300,0,15,356,386,757
3,eaf930cd08f943729b70dbbfcafd40e6,chatglm-6b,oasst-pythia-12b,I am a public account manager of a bank. How c...,"As a public account manager of a bank, it is y...","Sure, I'd be happy to help! To analyze the dat...",0,1,4442,1,31,353,365,749
4,20df2f0fbf504783a9fa3e31c9066165,wizardlm-13b,claude-instant-v1,explain the effects of diphenhydramine on the ...,Diphenhydramine is an antihistamine medication...,Diphenhydramine is an antihistamine medication...,0,1,19469,1,14,349,375,738


<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<b>Q1.2</b>

<a id="q12">**Creating a Tokenized Dataset for Model Training** </a>

Next, you'll create a tokenized dataset using [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library. To be specific - [from_pandas()](https://huggingface.co/docs/datasets/v3.4.1/en/package_reference/main_classes#datasets.Dataset.from_pandas).

However we cannot train model with raw text, we have to tokenize our text and it must return `input_ids`, `attention_mask` and corresponding `label` for every sample.

Here we provide you with skeleton, you should fill this appropriately.

```python

def tokenize(example, tokenizer):
    # 1. Format the input text with prompt and both responses. You have the freedom to prepend text with for eg: "Prompt:" and/or "Response A:" etc.

    # Combine the prompt and responses into a single text string
    
    # 2. Tokenize the text using the tokenizer
    # Make sure to set appropriate parameters for padding, truncation, etc. Set max_length to 1024
    
    # 3. Extract the input_ids and attention_mask
    
    # 4. Get the label from the example
    
    return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }

def load_data(df, tokenizer):
    # 1. Create a Dataset from the pandas DataFrame
    raw_datasets = Dataset.from_pandas(...)
    
    # 2. Apply the tokenize function to each example in the dataset
    # Make sure to:
    # - Pass the tokenizer to the function
    # - Remove original columns that are no longer needed
    # - Handle any other necessary parameters
    tokenized_datasets = raw_datasets.map(
        tokenize,
        ...
    )

    return tokenized_datasets
```

Expected output :

```python
print(tokenized_train)
print(tokenized_val)
```

```python
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 960
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 240
})
```
</div>


In [13]:
def tokenize(example, tokenizer):
    # 1. Combine the prompt and responses into a single text string
    text = (
        f"Prompt: {example['prompt']}\n\n"
        f"Response A: {example['response_a']}\n\n"
        f"Response B: {example['response_b']}"
    )

    # 2. Tokenize the text using the tokenizer (with max_length to 1024)
    tokens = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_attention_mask=True,
        add_special_tokens=True,
    )

    # 3. Extract the input_ids and attention_mask
    input_ids = tokens["input_ids"]
    attention_mask = tokens["attention_mask"]

    # 4. Get the label from the example
    label = example["label"]

    return {
        "input_ids":      input_ids,
        "attention_mask": attention_mask,
        "labels":         label,
    }

def load_data(df, tokenizer):
    # 1. Create a Dataset from the pandas DataFrame
    raw_datasets = Dataset.from_pandas(df, preserve_index=False)

    # 2. Apply the tokenize function to each example in the dataset
    # Make sure to:
    # - Pass the tokenizer to the function
    # - Remove original columns that are no longer needed
    # - Handle any other necessary parameters
    tokenized = raw_datasets.map(
        tokenize,
        fn_kwargs={"tokenizer": tokenizer},
        remove_columns=raw_datasets.column_names,
        batched=False,
    )
    return tokenized

tokenized_train = load_data(train_df, tokenizer)
tokenized_val   = load_data(val_df, tokenizer)

Map:   0%|          | 0/960 [00:00<?, ? examples/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

In [14]:
print(tokenized_train)
print(tokenized_val)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 960
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 240
})


<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<b>Q1.3 Setup LoraConfig and BitsAndBytesConfig </b>

Fine-tuning large language models like Llama 3.2 can be computationally expensive and memory-intensive. To address this, we utilize parameter-efficient fine-tuning techniques such as LoRA (Low-Rank Adaptation), which allows us to update only a small subset of the model's parameters, significantly reducing memory usage and training time while maintaining performance. Additionally, using quantization through BitsAndBytesConfig enables us to load and train the model with reduced precision (e.g., 4-bit or 8-bit), further optimizing memory usage and speeding up computations, especially with limited resources (GPUs.)

<a id="q13"></a>

**[Lora Config](https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig)**

We recommend following options -

```python
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias='none',
inference_mode=False,
task_type=TaskType.SEQ_CLS,
target_modules=['q_proj', 'k_proj', 'v_proj',]
```

(a) Ask the following question to [Claude.ai](claude.ai) and [gemini.google.com](https://gemini.google.com/). Note: We are asking two AI models, in case one decides to hallucinate. Also recommended reference documentation - https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Explain the significance of following parameters - `r`, `lora_alpha` and `target_modules`

Write the answer in your own words - concisely - 3-5 sentences.


(b) **Quantization** - Apply 4-bit quantization using BitsAndBytesConfig. Recommended settings -

```python
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
```


</div>


In [15]:
# 4-bit quantization using recommended config settings
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    # skip_module_names=["classifier", "llama.embed_tokens"]
)

# LoRA config using recommended config settings
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    inference_mode=False,
    task_type=TaskType.SEQ_CLS,
    target_modules=["q_proj", "k_proj", "v_proj"],
)

I asked the question to Claude and Gemini and I get that in LoRA, `r` stands for rank of the low-rank matrices that are used. The lowerer the rank, the more efficient and fewer trainable parameters, but limits the model's expressivity. `lora_alpha` is a scaling factor that adjusts the magnitude of LoRA updates compared to the original pre-trained weights, similar to the concept of learning rate or weighted average. Finally, `target_modules` specifies which layers of the model receive LoRA updates, which is up to me to decide to update the relevant parts of the network.

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<a id="q14"></a>
**1.4 Custom Classification Head for LLaMA**

In this question, you’ll implement a custom classification head for the pre-trained LLaMA model. Your task is to complete the implementation of the classifier by subclassing LlamaPreTrainedModel.

Your custom model should:
- Accept inputs (input_ids, attention_mask, labels, etc) and pass them through the pre-trained LLaMA model [LlamaModel](https://huggingface.co/docs/transformers/v4.49.0/en/model_doc/llama#transformers.LlamaModel).
- Extract embeddings from the last token of the sequence (based on attention mask).
- Classify embeddings into two labels using a linear layer `nn.Linear`.
- Output predictions using the Hugging Face standard SequenceClassifierOutput.

- Your class will resemble [LlamaForSequenceClassification](https://github.com/huggingface/transformers/blob/a22a4378d97d06b7a1d9abad6e0086d30fdea199/src/transformers/models/llama/modeling_llama.py#L893)

- We provide you with the loss function below.


Your implementation should match the following structure:

```python

class CS1090B_LlamaForClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        # YOUR CODE HERE
        pass
    
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        position_ids=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    )

        # YOUR CODE HERE


        # Calculate loss
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=pooled_logits,
            hidden_states=outputs.hidden_states if output_hidden_states else None,
            attentions=outputs.attentions if output_attentions else None,
        )

```

</div>

In [16]:
class CS1090B_LlamaForClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = getattr(config, "num_labels", 2)
        self.llama = LlamaModel(config)

        # classification head
        self.classifier = nn.Linear(config.hidden_size, self.num_labels)

        # initialize weights
        self.post_init()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        position_ids=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        # run inputs through LLaMA backbone
        outputs = self.llama(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        # sequence_output: (batch_size, seq_len, hidden_size)
        sequence_output = outputs.last_hidden_state

        # find index of last non-padded token for each example
        #    attention_mask is 1 for real tokens, 0 for padding
        #    sum over dim=1 gives lengths, subtract 1 to get last index
        lengths = attention_mask.sum(dim=1) - 1  # shape: (batch_size,)
        lengths = lengths.clamp(min=0)          # avoid negative indices

        # gather the embeddings at those last-token positions
        batch_range = torch.arange(sequence_output.size(0), device=sequence_output.device)
        pooled_embeddings = sequence_output[batch_range, lengths, :]  # (batch_size, hidden_size)

        # classify
        logits = self.classifier(pooled_embeddings)  # (batch_size, num_labels)

        # compute loss
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        # return in HF standard format
        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states if output_hidden_states else None,
            attentions=outputs.attentions if output_attentions else None,
        )

model = CS1090B_LlamaForClassification.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)
model.cuda()
model = get_peft_model(model, lora_config)

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of CS1090B_LlamaForClassification were not initialized from the model checkpoint at meta-llama/Llama-3.2-3B-Instruct and are newly initialized: ['classifier.bias', 'classifier.weight', 'llama.embed_tokens.weight', 'llama.layers.0.input_layernorm.weight', 'llama.layers.0.mlp.down_proj.weight', 'llama.layers.0.mlp.gate_proj.weight', 'llama.layers.0.mlp.up_proj.weight', 'llama.layers.0.post_attention_layernorm.weight', 'llama.layers.0.self_attn.k_proj.weight', 'llama.layers.0.self_attn.o_proj.weight', 'llama.layers.0.self_attn.q_proj.weight', 'llama.layers.0.self_attn.v_proj.weight', 'llama.layers.1.input_layernorm.weight', 'llama.layers.1.mlp.down_proj.weight', 'llama.layers.1.mlp.gate_proj.weight', 'llama.layers.1.mlp.up_proj.weight', 'llama.layers.1.post_attention_layernorm.weight', 'llama.layers.1.self_attn.k_proj.weight', 'llama.layers.1.self_attn.o_proj.weight', 'llama.layers.1.self_attn.q_proj.weight', 'llama.layers.1.self_attn.v_proj.weight', 'llama.layers.10.input_la

NotImplementedError: 

In [None]:
model.llama = prepare_model_for_kbit_training(model.llama)

model.llama.cuda()

model.llama = get_peft_model(model.llama, lora_config)

model.llama.print_trainable_parameters()

In [None]:
## Test your model
model_path = "meta-llama/Llama-3.2-3b-Instruct"

if 0:
    model = CS1090B_LlamaForClassification.from_pretrained(
            model_path,
            quantization_config=quantization_config,
            device_map="auto",
        )

    text = "Hello CS1090B!!"
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding="max_length",
        max_length=16,
        truncation=True
    )

    outputs = model(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=inputs["attention_mask"].to(device)
        )
    print(outputs.logits)

    del model, outputs, inputs, text
    gc.collect()
    torch.cuda.empty_cache()

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<a id="q15">**Q1.5 Metric and Training** </a>


<b>Q1.5.1 - Implementing Evaluation Metrics</b>

<a id="q151"></a>

Next, we are going to customize [`compute_metrics()`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.compute_metrics)` which will be utilized when we train our model.

Implement a `compute_metrics` function that takes evaluation predictions and calculates performance metrics. This function should:
- Take evaluation predictions as input (containing predictions and labels)
- Apply softmax to the raw prediction logits
- Calculate accuracy by comparing the predicted labels to the true labels
- Calculate log loss (cross-entropy) between the predicted probabilities and true labels
- Return a dictionary containing both metrics


```python
def compute_metrics(eval_pred):
    # Your code here

    return {
            "accuracy": accuracy,
            "log_loss": loss
        }
```


</div>

In [None]:
def compute_metrics(eval_pred):
    """
    Compute accuracy and log loss (cross-entropy) for a classification task.

    Args:
        eval_pred: either a tuple (logits, labels) or an EvalPrediction
                   with attributes .predictions and .label_ids.

    Returns:
        dict with keys "accuracy" and "log_loss".
    """
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # turn logits into probabilities via softmax
    exp_scores = np.exp(logits - np.max(logits, axis=1, keepdims=True))
    probs = exp_scores / exp_scores.sum(axis=1, keepdims=True)

    preds = np.argmax(probs, axis=1)
    accuracy = accuracy_score(labels, preds)
    loss = log_loss(labels, probs)

    return {
        "accuracy": accuracy,
        "log_loss": loss
    }

In [None]:
## Test your compute_metrics function

# Simulate predictions (logits) and true labels
num_samples = 5
num_classes = 2
np.random.seed(42)
logits = np.random.randn(num_samples, num_classes)
labels = np.random.randint(0, num_classes, size=num_samples)

# Create a mock object to simulate eval_pred
EvalPred = namedtuple("EvalPred", ["predictions", "label_ids"])
eval_pred = EvalPred(predictions=logits, label_ids=labels)

print(compute_metrics(eval_pred))
del num_samples, num_classes, logits, labels, eval_pred

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<b>Q1.5.2 - Model Training </b>

<a id="q152"></a>


Implement a complete training pipeline for the LLaMA classification model using the HuggingFace Trainer API. Your implementation should:

 - Load and prepare the tokenized datasets created earlier i.e. call `load_data()`
 - Initialize the CS1090B_LlamaForClassification model with the appropriate quantization configuration
 - Prepare the model for k-bit training [Link](https://huggingface.co/docs/peft/v0.15.0/en/package_reference/peft_model#peft.prepare_model_for_kbit_training). Recommended setting `use_gradient_checkpointing=False`
 - Apply the LoRA adapters using the configuration from Q1.3 [Link](https://huggingface.co/docs/peft/v0.15.0/en/package_reference/peft_model#peft.get_peft_model)
 - Print number of trainable parameters - `model.print_trainable_parameters()`
 - Configure training arguments with appropriate hyperparameters (learning rate, batch size, etc.) [Link](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)

```python
 Recommended parameters -

 training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=1,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=4,
        warmup_steps=50,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        learning_rate=1e-4,
        report_to="none",
    )

```

 - Create a data collator for padding [Link](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

 - Initialize the Trainer with the model, training arguments, tokenizer, datasets, data collator and evaluation metrics.
 - Execute the training process
 - Save the trained model
 - Evaluate the model on the validation set and report the results. Validation accuracy should be above 0.6.

 Note: Pay careful attention to batch sizes and gradient checkpointing settings, as these can significantly impact training time and memory usage. Approx. expected training time on JuputerHub ~35mins.

</div>

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=4,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    learning_rate=1e-4,
    report_to="none",
)

model.add_adapter(lora_config , adapter_name="lora_1")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("./results")

results = trainer.evaluate()   # runs on tokenized_val
print("Validation results:", results)

<div class="alert alert-success" style="color: #333; background-color: #e8fffb; border-color: #bcfff2; border-width: 1px; border-radius: 3px; padding: 10px;">

<b>Q1.6 - Predict and Display Confusion Matrix </b>

<a id="q16"></a>

Run the cell below to display confusion matrix on validation set
</div>

In [None]:
# `tokenized_val` is a HuggingFace Dataset object
output = trainer.predict(tokenized_val)
logits = output.predictions
print(logits.shape)

predicted_classes = np.argmax(logits, axis=1)

true_labels = output.label_ids


cm = confusion_matrix(true_labels, predicted_classes)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix on Validation Set")
plt.show()


In [None]:
time_end = time.time()
print(f"It took {(time_end - time_start)/60:.2f} minutes for this notebook to run")

**This concludes Part 1 👏**

**Please continue to Parts 2 & 3 inthe second notebook**