## Llama finetuning to evaluate math questions - Deep Learning Midterm

### Kaushik Mellacheruvu
### Srirama Bulusu

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [1]:
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth
  Downloading unsloth-2024.11.7-py3-none-any.whl.metadata (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.7/59.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth-zoo>=2024.11.1 (from unsloth)
  Downloading unsloth_zoo-2024.11.5-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.8.14-py3-none-any.whl.metadata (8.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Download

Found existing installation: unsloth 2024.11.7
Uninstalling unsloth-2024.11.7:
  Successfully uninstalled unsloth-2024.11.7
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-gsdps4ws/unsloth_e1cc5bec76424e08abdad543e362f280
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-gsdps4ws/unsloth_e1cc5bec76424e08abdad543e362f280
  Resolved https://github.com/unslothai/unsloth.git to commit f26d4e739ed507de7a9088da53d10fd02f58d160
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2024.11.7-py3-none-a

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Load model and wrap with LoRA adapters

The chosen LoRA adapters:
1. r = 64
2. lora_alpha = 64
3. use_rslora = True

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [5]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

README.md:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

Updated the prompt to make the model understand.
Added the solution as part of the prompt to give more context to the model for training

In [6]:
prompt = """You are a skilled mathematician. Your task is to assess the accuracy of a given answer to a math question based on the explanation provided. Carefully check both the answer and explanation. If both are correct, respond with 'True'. If either the answer or explanation is incorrect, respond with 'False'.

### Question:
{}

### Answer:
{}

### Explanation
{}

### Output:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    question = examples["question"]
    ans       = examples["answer"]
    output      = examples["is_correct"]
    solution = examples["solution"]
    texts = []
    for instruction, input, solution, output in zip(question, ans, solution, output):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, solution, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

In [7]:
# Process the training dataset and generate prompt for each datapoint
train_dataset = dataset['train'].shuffle(seed=3407)
train_dataset = train_dataset.map(formatting_prompts_func, batched = True,)

train_valid_split = train_dataset.train_test_split(test_size=0.005, seed=3407)

# Access the splits
train_data = train_valid_split['train']
valid_data = train_valid_split['test']

print(f"Training data size: {len(train_data)}")
print(f"Validation data size: {len(valid_data)}")

Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Training data size: 995000
Validation data size: 5000


In [8]:
#print a smaple training example
train_data['text'][0]

"You are a skilled mathematician. Your task is to assess the accuracy of a given answer to a math question based on the explanation provided. Carefully check both the answer and explanation. If both are correct, respond with 'True'. If either the answer or explanation is incorrect, respond with 'False'.\n\n### Question:\nMark is baking bread. He has to let it rise for 120 minutes twice. He also needs to spend 10 minutes kneading it and 30 minutes baking it. How many minutes does it take Mark to finish making the bread?\n\n### Answer:\n280\n\n### Explanation\nLet's solve this problem using Python's sympy library.\n<llm-code>\nimport sympy as sp\n\n# each rise takes 120 minutes\nrise_time_one = 120\nrise_time_two = 120\nrise_time = rise_time_one + rise_time_two\n\n# kneading takes 10 minutes and baking takes 30 minutes\nbaking_time = 30\nkneading_time = 10\n\n# add all times to get total time\ntotal_time = rise_time + kneading_time + baking_time\ntotal_time\n</llm-code>\n<llm-code-output

## SFT

The chosen parameters:
1. batch size = 4
2. gradient accumulation steps = 32
3. warm up steps = 40
4. max steps = 500

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 32,
        warmup_steps = 40,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 500,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        eval_steps=50,  # Evaluate every 50 steps
        evaluation_strategy="steps",
        output_dir = "outputs",
        report_to="wandb",
        run_name="llama-finetuning-run"
    )

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset=train_data,
    eval_dataset=valid_data,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args
)

max_steps is given, it will override any value given in num_train_epochs


Initialize Wandb for tracking training and validation loss

In [15]:
import wandb
import os

wandb.init(
    project="llama-finetuning",
    entity="kaushikmelch-new-york-university",
    config={
        "learning_rate": 2e-4,
        "per_device_train_batch_size": 4,
        "gradient_accumulation_steps": 32,
        "max_steps": 500,
        "warmup_steps": 40,
    }
)


[34m[1mwandb[0m: Currently logged in as: [33mkaushikmelch[0m ([33mkaushikmelch-new-york-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [16]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 995,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 32
\        /    Total batch size = 128 | Total steps = 500
 "-____-"     Number of trainable parameters = 167,772,160


Step,Training Loss,Validation Loss
50,0.6343,0.630645
100,0.5822,0.591511
150,0.5482,0.55257
200,0.5305,0.520559
250,0.5098,0.496463
300,0.4647,0.47325
350,0.4565,0.452602
400,0.4567,0.436098
450,0.4312,0.42386
500,0.4507,0.417399


In [17]:
# Log final stats
wandb.log({"final_loss": trainer_stats.training_loss})

# Save and upload the model to WandB
trainer.save_model("outputs/llama_finetuned_model")
artifact = wandb.Artifact("llama_model", type="model")
artifact.add_dir("outputs")
wandb.log_artifact(artifact)

[34m[1mwandb[0m: Adding directory to artifact (./outputs)... Done. 10.6s


<Artifact llama_model>

## inference for one datapoint

In [18]:
# Sample inferene data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]
sample_sol = test_dataset['solution'][0]

In [19]:
print(test_dataset[0])

{'question': 'The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?', 'is_correct': True, 'answer': '205', 'solution': "Let's solve this problem using Python code.\n<llm-code>\nminutes_per_hour = 60\nminutes_left_before_5 = 5 * minutes_per_hour\ntotal_time_spent_by_family = 45 + 30 + 20\nminutes_before_5_after_family = minutes_left_before_5 - total_time_spent_by_family\nminutes_before_5_after_family\n</llm-code>\n<llm-code-output>\n205\n</llm-code-output>\nThus Mrs. Parker will have \\boxed{205} minutes in the bathroom before the family leaves."}


In [20]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        sample_sol,
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Promt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response

Input Promt:
 You are a skilled mathematician. Your task is to assess the accuracy of a given answer to a math question based on the explanation provided. Carefully check both the answer and explanation. If both are correct, respond with 'True'. If either the answer or explanation is incorrect, respond with 'False'.

### Question:
The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?

### Answer:
205

### Explanation
Let's solve this problem using Python code.
<llm-code>
minutes_per_hour = 60
minutes_left_before_5 = 5 * minutes_per_hour
total_time_spent_by_family = 45 + 30 + 20
minutes_before_5_after_family = minutes_left_before_5 - total_time_spent_by_family
minutes_before_5_a

['True']

## saving model

In [21]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [22]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Inference for entire test dataset


In [None]:
import gc
def clear_gpu_memory():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

Batching the data into 16 for faster evaluation

In [None]:
import pandas as pd

batch_size = 16
results = []

for start_idx in range(0, len(test_dataset), batch_size):
    batch = test_dataset.select(range(start_idx, min(start_idx + batch_size, len(test_dataset))))
    input_prompts = [
        prompt.format(entry['question'], entry['answer'], entry['solution'], "") for entry in batch
    ]
    inputs = tokenizer(input_prompts, return_tensors="pt", padding=True).to("cuda")
    input_token_lengths = [len(input_prompt) for input_prompt in inputs['input_ids']]


    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
    generated_responses = [
        tokenizer.decode(output[input_len:], skip_special_tokens=True).strip()
        for output, input_len in zip(outputs, input_token_lengths)
    ]

    for idx, (entry, response) in enumerate(zip(batch, generated_responses)):
        is_correct = response.lower() == str(entry['is_correct']).lower()
        results.append((start_idx + idx, bool(is_correct)))

    print(f"Processed batch starting at index {start_idx}")
    clear_gpu_memory()


df = pd.DataFrame(results, columns=["ID", "is_correct"])
df.to_csv("inference_results.csv", index=False)

print("Inference complete. Results saved to inference_results.csv.")

Processed batch starting at index 0
Processed batch starting at index 16
Processed batch starting at index 32
Processed batch starting at index 48
Processed batch starting at index 64
Processed batch starting at index 80
Processed batch starting at index 96
Processed batch starting at index 112
Processed batch starting at index 128
Processed batch starting at index 144
Processed batch starting at index 160
Processed batch starting at index 176
Processed batch starting at index 192
Processed batch starting at index 208
Processed batch starting at index 224
Processed batch starting at index 240
Processed batch starting at index 256
Processed batch starting at index 272
Processed batch starting at index 288
Processed batch starting at index 304
Processed batch starting at index 320
Processed batch starting at index 336
Processed batch starting at index 352
Processed batch starting at index 368
Processed batch starting at index 384
Processed batch starting at index 400
Processed batch star