# Finetuning llama 3.2 3b for text classification (imdb reviews)

In this tutorial, we will finetune a llama 3.2 3b model for text classification on the imdb movie reviews dataset. This is a binary sentiment classification task. The reveiws are classified as positive or negative.

Llama is an open-source, autoregressive, large language model and can be finetuned for various tasks including text classification.

We will use parameter efficient finetuning (PEFT) to finetune the model using low-rank adaptation (LoRA). This technique allows us to finetune the model with a small number of parameters instead of the full model, making it more efficient and cost-effective.

After finetuning, we can store only the LoRA parameters, which are much smaller than the full model parameters.

This tutorial is based on the following online [tutorial](https://www.datacamp.com/tutorial/fine-tuning-llama-3-1)

first step is to acquire the licence to use the model weights. you can do this on huggingface, meta or kaggle. it requires minimal information about the user. once you have the licence, you can download the model weights from the model hub. in this tutorial, we will use the llama 3.2 3b model from huggingface.

the packages we will use for this tutorial are:

- transformers
- accelerate
- bitsandbytes
- peft
- trl 

make sure to install these packages before running the code. 

```bash
pip install transformers accelerate bitsandbytes peft trl
```

we will use the `transformers` library to load the llama 3.2 3b model and finetune it for text classification.

the `accelerate` library will help us manage the training process and use the GPU for faster training.

the `bitsandbytes` library will help us load the model with 8-bit quantization, which is more efficient than full 32-bit precision.

In [1]:
import os
from dotenv import load_dotenv

os.environ["CUDA_VISIBLE_DEVICES"] = "2"

load_dotenv()

HF_token = os.getenv("HF_token")
CACHE_DIR = os.getenv("CACHE_DIR")
OUTPUT_DIR = "./llama_finetuned"
WANDB_PROJECT = "llama-imdb-finetuning"

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)



In [2]:
# libs imports
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer, SFTConfig
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split
from datasets import Dataset
from accelerate import Accelerator

In [6]:
# important step when using accelerator to set the device index so that the model and the peft adapter are loaded on the same device


device_index = Accelerator().process_index

print(f"Using device index: {device_index}")

device_map = {"": device_index}

Using device index: 0


in this step we will load and process the dataset. to simplify things, we will use the same imdb dataset that we used with bert. recall this is a dataset of 50,000 movie reviews from the internet movie database. the reviews are labeled as positive or negative.

after loading the dataset, we need to append our prompt to the begining of each review since we are finetuning a chat model. 

the training prompt structure is as follows:

```
classify the sentiment of the following movie review as positive or negative. Respond with 'positive' or 'negative' and nothing else.
review: {text}
sentiment_label: {label}
```

the test prompt structure is as follows:

```
classify the sentiment of the following movie review as positive or negative. Respond with 'positive' or 'negative' and nothing else.
review: {text}
sentiment_label:
```

in this way we are instructing the model to generate the sentiment label (next token) for the review.



In [7]:
from bert_model import load_imdb_data

def generate_prompt(item):
    label = 'positive' if item['label'] == 1 else 'negative'
    prompt = f"""classify the sentiment of the following movie review as positive or negative. Respond with 'positive' or 'negative' and nothing else.

    review: {item['review']}
    sentiment_label: {label}
    """.strip()
    return {"text": prompt}

def generate_test_prompt(item):
    prompt = f"""classify the sentiment of the following movie review as positive or negative. Respond with 'positive' or 'negative' and nothing else.

    review: {item['review']}
    sentiment_label:
    """.strip()
    return {"text": prompt}

train_texts, train_labels, val_texts, val_labels, test_texts, test_labels = load_imdb_data(dataset_fraction=0.1)

train_dict = {
    'review': train_texts,
    'label': train_labels,
}

val_dict = {
    'review': val_texts,
    'label': val_labels,
}

test_dict = {
    'review': test_texts,
    'label': test_labels,
}

train_dataset = Dataset.from_dict(train_dict)
val_dataset = Dataset.from_dict(val_dict)
test_dataset = Dataset.from_dict(test_dict)

train_dataset = train_dataset.map(generate_prompt)
val_dataset = val_dataset.map(generate_prompt)
test_dataset = test_dataset.map(generate_test_prompt)

# print a sample of the dataset
print(train_dataset[0])


loading imdb dataset...
using 10.0% of dataset
train label distribution: Counter({1: 1258, 0: 1242})
test label distribution: Counter({1: 1258, 0: 1242})
train samples: 2500
validation samples: 500
test samples: 2000
example review: An unusually straight-faced actioner played by a cast and filmed by a director who obviously took the material seriously. Imperfect, as is to be expected from a film clearly shot on a tight budget, bu...
example label: 1 (positive)


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'review': 'An unusually straight-faced actioner played by a cast and filmed by a director who obviously took the material seriously. Imperfect, as is to be expected from a film clearly shot on a tight budget, but the drama is involving-- it\'s one of those films that when it gets repeated ad nauseum on Cinemax 2 or More Max or whatever they call it, you end up watching 40 minute blocks when you\'re supposed to be going to work. Along W/ "Deathstalker 2", "Chopping Mall", and "The Assault", a reminder that Wynorski is a much more talented director than many of his fellow low-budget brethern, who has a real ability to pace a genre film, when he actually\'s interested in the material (i.e., don\'t bother watching any of his Shannon Tweed flicks with a 3 or a 4 after the title!) Actors who\'ve had too little to do recently (Mancuso, Ford, even Gary Sandy for chrissakes) really put their all into some of their best roles in years -- as for Grieco, he has the right look, although his acting

now we will load the model and tokenizer. we will use the llama 3.2 3b model from the huggingface model hub. the model card can be found [here](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).

In [8]:
base_model_name = "meta-llama/Llama-3.2-3B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

current_device = 2  # load model on gpu 3 - very important for quantized training
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map=device_map,  # use current device instead of auto
    torch_dtype="float16",
    quantization_config=bnb_config,
    cache_dir=CACHE_DIR,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

before finetuning, let's evaluate the model on the test set to measure its baseline performance.

In [18]:
def predict(test_dataset, model, tokenizer):
    y_pred = []
    categories = ["positive", "negative"]
    
    for i in tqdm(range(len(test_dataset))):
        prompt = test_dataset[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens=2, 
                        temperature=0.1)
        
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("sentiment_label:")[-1].strip()
        
        # determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    
    return y_pred



y_pred = predict(test_dataset, model, tokenizer)

print(y_pred)

next we evaluate the baselinemodel's performance on the test set.

In [50]:
def evaluate(y_true, y_pred):
    labels = ["positive", "negative"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
    print('\nConfusion Matrix:')
    print(conf_matrix)

y_true = test_dataset["label"]
y_true = ["positive" if label == 1 else "negative" for label in y_true]
evaluate(y_true, y_pred)


Accuracy: 0.930
Accuracy for label positive: 0.957
Accuracy for label negative: 0.903

Classification Report:
              precision    recall  f1-score   support

    positive       0.91      0.96      0.93      1006
    negative       0.96      0.90      0.93       994

   micro avg       0.93      0.93      0.93      2000
   macro avg       0.93      0.93      0.93      2000
weighted avg       0.93      0.93      0.93      2000


Confusion Matrix:
[[963  42]
 [ 95 898]]


as you can see, the model is already at a very good accuracy. that's because the dataset is super easy. let's see if we can push it a bit more by finetuning.

we will start building the model according to the following steps:

1. we extract all the linear modules from the models using the bitsandbytes library.
2. then we configure LoRA to target these linear modules.
3. we create a trainer object that will handle (and abstract) the training loop. using the trl library.
4. we train and evaluate the model.

In [9]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
modules = find_all_linear_names(model)
modules

['up_proj', 'down_proj', 'o_proj', 'v_proj', 'k_proj', 'q_proj', 'gate_proj']

before we start training we want to keep track of our training process and evals. in this tutorial, we will use weights and biases (wandb) which is a tool with a nice UI and a lot of features.

```bash
pip install wandb
```

```

In [10]:

import wandb

wandb.login()

wandb.init(project=WANDB_PROJECT)


run = wandb.init(project=WANDB_PROJECT, job_type="training")

[34m[1mwandb[0m: Currently logged in as: [33mkennygu[0m ([33mkennygu-georgetown-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [11]:
# main training loop

output_dir= OUTPUT_DIR

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)


sft_config = SFTConfig(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    logging_steps=1,                         
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="wandb",                  # report metrics to w&b
    do_eval=True,
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.05,
    dataset_text_field="text",
    max_seq_length=512,
    packing=False,
    dataset_kwargs={"add_special_tokens": False, "append_concat_token": False},
    run_name="llama-imdb-finetuning_e1",
)


trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    processing_class=tokenizer
)

# name wb run
run.name = "llama-imdb-finetuning_e1"


average_tokens_across_devices is set to True but it is invalid when world size is1. Turn it to False automatically.


Adding EOS to train dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2500 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [12]:
# now it's time to start the training loop
trainer.train()

Step,Training Loss,Validation Loss
16,2.8191,2.717163
32,2.6223,2.678006
48,2.6516,2.632457
64,2.8801,2.625839
80,2.7354,2.619114
96,2.7021,2.615744
112,2.534,2.613143
128,2.4517,2.610193
144,2.5211,2.608157
160,2.617,2.606247


TrainOutput(global_step=313, training_loss=2.6217566930447904, metrics={'train_runtime': 1023.7772, 'train_samples_per_second': 2.442, 'train_steps_per_second': 0.306, 'total_flos': 1.3084378739656704e+16, 'train_loss': 2.6217566930447904})

In [13]:
# finish wandb run and set the model to use cache for inference
wandb.finish()
model.config.use_cache = True

0,1
eval/loss,█▆▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁
eval/mean_token_accuracy,▁▃▆▆▇▇▇▇▇▇▇████████
eval/num_tokens,▁▁▂▂▃▃▃▄▄▄▅▅▆▆▆▇▇██
eval/runtime,▁▆██▇▇▆▆███▆▇▇▇▆▇▇▆
eval/samples_per_second,█▃▁▁▂▂▃▂▁▂▁▃▂▂▂▃▂▂▃
eval/steps_per_second,█▃▁▁▂▂▂▂▁▁▁▂▂▂▂▂▂▂▃
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▇▇▇▇█
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/grad_norm,▅▃▃▃█▆▅▆▅█▃▄▆▄▃▄▅▃▄▆▃▃▄▅▅▃▃▅▄▄▅▅▄▄▅▂▅▃▄▁
train/learning_rate,████████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁

0,1
eval/loss,2.59766
eval/mean_token_accuracy,0.48478
eval/num_tokens,727007.0
eval/runtime,26.9277
eval/samples_per_second,18.568
eval/steps_per_second,2.34
total_flos,1.3084378739656704e+16
train/epoch,1.0
train/global_step,313.0
train/grad_norm,0.20922


In [14]:
# save the model and tokenizer for inference (later)
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

('./llama_finetuned/tokenizer_config.json',
 './llama_finetuned/special_tokens_map.json',
 './llama_finetuned/chat_template.jinja',
 './llama_finetuned/tokenizer.json')

let's evaluate our finetuned model on the test set.

In [19]:
# evaluate the finetuned model on the test set (similar to what we did in the baseline model)

y_true = test_dataset["label"]
y_true = ["positive" if label == 1 else "negative" for label in y_true]
y_pred = predict(test_dataset, model, tokenizer)
evaluate(y_true, y_pred)

  0%|          | 0/2000 [00:00<?, ?it/s]Device set to use cuda:0
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  0%|          | 1/2000 [00:00<09:17,  3.59it/s]Device set to use cuda:0
  0%|          | 2/2000 [00:00<06:00,  5.55it/s]Device set to use cuda:0
  0%|          | 3/2000 [00:00<06:49,  4.88it/s]Device set to use cuda:0
  0%|          | 4/2000 [00:00<06:28,  5.14it/s]Device set to use cuda:0
  0%|          | 5/2000 [00:00<05:43,  5.81it/s]Device set to use cuda:0
  0%|          | 6/2000 [00:01<05:09,  6.44it/s]Device set to use cuda:0
  0%|          | 7/2000 [00:01<05:04,  6.55it/s]Device set to use cuda:0
Device set to use cuda:0
  0%|          | 9/2000 [00:01<05:38,  5.88it/s]Device set to use cuda:0
  0%|          | 10/2000 [00:01<05:07,  6.47it/s]Device set to use cuda:0
  1%|          | 11/2000 [00:01<04:43,  7.02it/s]Device set to use cuda:0
  1%|          | 12/2000 [00:01<04:38,  7.13it/s]Device set to use cuda:0
  1%|          

NameError: name 'evaluate' is not defined

In [21]:
evaluate(y_true, y_pred)

Accuracy: 0.963
Accuracy for label positive: 0.973
Accuracy for label negative: 0.952

Classification Report:
              precision    recall  f1-score   support

    positive       0.95      0.97      0.96      1006
    negative       0.97      0.95      0.96       994

    accuracy                           0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000


Confusion Matrix:
[[979  27]
 [ 48 946]]
