# Finetuning LLaMA 3.2 1B on Python LeetCode Solutions

This notebook demonstrates how I fine-tuned a 1B parameter LLaMA 3.2 model using the [Unsloth](https://github.com/unslothai/unsloth) library for efficient fine tuning. The goal is to teach the model how to generate clean Python solutions to LeetCode-style algorithm problems, using natural-language problem descriptions as input.

adapted from: 
- https://www.youtube.com/watch?v=YZW3pkIR-YE&t=505s

other sources:
- https://www.youtube.com/watch?v=bZcKYiwtw1I&t=572s
- https://huggingface.co/docs/trl/en/sft_trainer#format-your-input-prompts
- https://stackoverflow.com/questions/1663807/how-do-i-iterate-through-two-lists-in-parallel
- https://discuss.huggingface.co/t/guide-the-best-way-to-calculate-the-perplexity-of-fixed-length-models/193/2
- https://stackoverflow.com/questions/59209086/calculate-perplexity-in-pytorch
- https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html
- https://en.wikipedia.org/wiki/Perplexity
- https://huggingface.co/docs/transformers/perplexity
- https://huggingface.co/spaces/evaluate-metric/bleu
- https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b/

Dataset used: 
https://huggingface.co/datasets/LimYeri/LeetCode_Python_Solutions_v2

All imports and libraries needed for this Project

In [1]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import standardize_sharegpt
from datasets import load_dataset
from transformers import TextStreamer
from tqdm import tqdm
import math
import random


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!


# Load Base Llama 3.2 1 B Model
- use 4 bit quantization. This greatly reduces GPU memory usage.

In [2]:
max_seq_length = 2048 
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    GRID A100X-20C. Num GPUs = 1. Max memory: 19.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


# Applying Low Rank Adaptation
- updates only a small number of parameters in specific layers, uses gradient checkpointing. This makes the model lightweight enough for limited hardware.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


# Formatting the Data
- This prepares the dataset to be used for fine tuning by converting it into a format Llama 3.2 is compatible with. 
- Use LLaMA 3.1-style chat template for formatting prompts

In [4]:


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(example):
    # get the two parts of the dataset we care about - the question and the python function
    question = example["question_content"]
    content = example["content"]

    # for finetuning to work correctly, we need Just the code - no inline comments or explanations
    lines = content.strip().splitlines()
    
    code_lines = []
    in_code_block = False
    seen_function = False

    for line in lines:
        stripped = line.strip()

        # See if we are in a code block
        if stripped.startswith("```python") or stripped.startswith("```"):
            if in_code_block:
                in_code_block = False
            else:
                in_code_block = True
            continue

        if not in_code_block:
            continue

        # find out if it is the start of a function
        if stripped.startswith("def "):
            seen_function = True

        # now we are in a function block 
        if seen_function:
            new_line = ""
            # Track starting of strings, quote character(single/ddouble)
            in_string = False
            quote = None
            # loop iver each character
            for char in line:
                # handle each string start and end
                if char in {"'", '"'}:
                    if in_string and char == quote:
                        in_string = False
                    elif not in_string:
                        in_string = True
                        quote = char
                # stop if we run into # for inline comments
                if not in_string and char == "#":
                    break
                new_line += char
            # append lines that are not empty
            if new_line.strip():
                code_lines.append(new_line.rstrip())
    # cleaned code - join all lines
    code = "\n".join(code_lines).strip()
    
    # return none if not code
    if not code:
        return {"text": None}

    # follow llama's chat template to format. 
    prompt = tokenizer.apply_chat_template(
        [
            {
                "role": "user",
                "content": f"{question.strip()}\n\nWrite a Python function to solve the above problem.",
            },
            {
                "role": "assistant",
                "content": code,
            },
        ],
        tokenize=False,
        add_generation_prompt=False,
    )

    return {"text": prompt}


# load the full dataset
full_dataset = load_dataset("LimYeri/LeetCode_Python_Solutions_v2", split="train")

# split into train and test (90% train 10% test)
dataset_split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

# preprocess train set
train_dataset = standardize_sharegpt(train_dataset)
train_dataset = train_dataset.map(formatting_prompts_func, batched=False)

# preprocess test set
test_dataset = standardize_sharegpt(test_dataset)
test_dataset = test_dataset.map(formatting_prompts_func, batched=False)

# function to filter out invalid samples
def is_valid(example):
    text = example.get("text")
    if not text:
        return False
    if not isinstance(text, str):
        return False
    if not text.strip():
        return False
    if isinstance(text, list):
        return False
    return True

# filter train and test sets
train_dataset = train_dataset.filter(is_valid)
test_dataset = test_dataset.filter(is_valid)

Double check Data before proceeding

In [5]:
print("Example from training dataset:")
print(train_dataset[50]["text"])

print("Example from test dataset:")
print(test_dataset[50]["text"])

Example from training dataset:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are given an integer array `nums` where the `ith` bag contains `nums[i]` balls. You are also given an integer `maxOperations`.

You can perform the following operation at most `maxOperations` times:

*   Take any bag of balls and divide it into two new bags with a **positive** number of balls.
    *   For example, a bag of `5` balls can become two new bags of `1` and `4` balls, or two new bags of `2` and `3` balls.

Your penalty is the **maximum** number of balls in a bag. You want to **minimize** your penalty after the operations.

Return _the minimum possible penalty after performing the operations_.

**Example 1:**

**Input:** nums = \[9\], maxOperations = 2
**Output:** 3
**Explanation:** 
- Divide the bag with 9 balls into two bags of sizes 6 and 3. \[**9**\] -> \[6,3\].
- 

# Initialize SFTTrainer

- This block contains the training arguments for our finetuning.
This configuration uses:
- A batch size of 2, with gradient accumulation of 4 
- Linear learning rate scheduler with a learning rate of `2e-4`
- 8-bit AdamW optimizer to reduce memory usage
- Mixed-precision training (FP16 or BF16 depending on hardware support)
- 3  epochs of training over the dataset

Other options:
- packing=False: disables packing multiple sequences together 
- dataset_text_field='text': the dataset contains pre-formatted input/output examples under the "text" key
- output_dir="outputs": directory for saving checkpoints



In [8]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        
        num_train_epochs = 3, 
        #max_steps = 2000,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 100,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Verify Dataset before starting training.

In [9]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nOur hero Teemo is attacking an enemy Ashe with poison attacks! When Teemo attacks Ashe, Ashe gets poisoned for a exactly `duration` seconds. More formally, an attack at second `t` will mean Ashe is poisoned during the **inclusive** time interval `[t, t + duration - 1]`. If Teemo attacks again **before** the poison effect ends, the timer for it is **reset**, and the poison effect will end `duration` seconds after the new attack.\n\nYou are given a **non-decreasing** integer array `timeSeries`, where `timeSeries[i]` denotes that Teemo attacks Ashe at second `timeSeries[i]`, and an integer `duration`.\n\nReturn _the **total** number of seconds that Ashe is poisoned_.\n\n**Example 1:**\n\n**Input:** timeSeries = \\[1,4\\], duration = 2\n**Output:** 4\n**Explanation:** Teemo's attacks on Ashe

In [10]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = GRID A100X-20C. Max memory = 19.996 GB.
1.49 GB of memory reserved.


# Start Finetuning

In [11]:
trainer_stats = trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 14,143 | Num Epochs = 3 | Total steps = 5,304
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
100,0.8682
200,0.6715
300,0.6335
400,0.5991
500,0.5594
600,0.551
700,0.4921
800,0.4898
900,0.4694
1000,0.442


# Inference on the Fintuned Model 


- After training, we can test the model by feeding it natural language prompts and observing its Python code generation. Below is an example where the model is asked to solve various problems from LeetCode.

- We use the llama-3.1 chat template to structure the prompt. The output is then decoded back into human readable text using the tokenizer.


In [12]:

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "solve th 2sum problem leetcode,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 0.7, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nsolve th 2sum problem leetcode,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou are given four integers `num1`, `num2`, `minimum`, and `maximum`. Each of the integers `num1` and `num2` contains **exactly** `two` digits.\n\nReturn _the minimum sum that the two numbers can be added as integers to have_. Return `0`']

In [13]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": """
    Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

You can return the answer in any order.
    """},]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

def twoSum(self, nums: List[int], target: int) -> List[int]:
        for i in range(len(nums)):
            for j in range(i+1,len(nums)):
                if nums[i]+nums[j]==target:
                    return [i,j]<|eot_id|>


In [14]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": """
    Given the `root` of a binary tree, return _the zigzag level order traversal of its nodes' values_. (i.e., from left to right, then right to left for the next level and alternate between).
    """},]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

def zigzagLevelOrder(self, root: Optional[TreeNode]) -> List[List[int]]:
        ans = []
        q = collections.deque()
        q.append(root)
        while q:
            temp = []
            for i in range(len(q)):
                node = q.popleft()
                if node.left: q.append(node.left)
                if node.right: q.append(node.right)
                temp.append(node.val)
            ans.append(temp)
        return ans<|eot_id|>


# Evaluating Model Performance: Perplexity

To measure how well our finetuned LLaMA model learned the LeetCode task, we calculate **perplexity**, a common metric for language modeling.

Perplexity reflects how surprised the model is by the correct answer. The lower the score, the better the model's predictive ability.

- Evaluates the model’s ability to predict the next set of tokens.
- Lower values = better fluency and alignment with expected output
- computed perplexity over 1000 examples using a sliding window (stride) approach
Limitation: 
- Perplexity evaluation is only run on the training set, not a validation set. As a result, it measures training effectiveness, but not real-world performance.
### Formula:

**Perplexity = exp(−1/N * Σ log P(xᵢ))**
Where:
- **N** is the number of tokens
- **P(xᵢ)** is the predicted probability of the *i-th* token


In [15]:

# get max input length
max_length = model.config.max_position_embeddings

device = model.device

# https://huggingface.co/docs/datasets/v1.2.0/processing.html used this to reference the shuffle() function
shuffled_dataset = test_dataset.shuffle(seed = 42).select(range(1000)) # the range here is the number of samples we are testing.

# accumulators to keep track of total negative log likelihood and total number of tokens used
nll_sum = 0.0
n_tokens = 0

stride = 512
model.eval()

# loop over each prompt
# reference for explanation of code: https://huggingface.co/docs/transformers/perplexity

for example in tqdm(shuffled_dataset):
    prompt = example['text']
    encodings = tokenizer(prompt, return_tensors="pt")
    input_ids = encodings.input_ids
    seq_len = input_ids.size(1)

    prev_end_loc = 0

    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        inputs = input_ids[:, begin_loc:end_loc].to(device)

        targets = inputs.clone()
        targets[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(inputs, labels = targets)
            n_log = outputs.loss

        num_valid_tokens = (targets != -100).sum().item() 
        num_loss_tokens = num_valid_tokens - inputs.size(0)
        nll_sum += n_log * num_loss_tokens
        n_tokens += num_loss_tokens

        prev_end_loc = end_loc
        if end_loc == seq_len:
            break

avg_nll = nll_sum / n_tokens  # average negative log-likelihood per token
ppl = torch.exp(avg_nll)

print(f"perplexity: {ppl}")

100%|██████████| 1000/1000 [00:46<00:00, 21.35it/s]

perplexity: 1.2483659982681274





# Analysis of perplexity
- A perplexity value close to 1.0 indicates that the model is perfect. 
- This model achieved a perplexity of approximately **1.2**, which indicates that the finetuning was successful.
- This aligns with qualitative inspection of the generated code, which looks mostly syntactically and semantically valid, however, it hasn't been tested quantitatively.

In [16]:
# Save Model locally.
trainer.model.save_pretrained("models/llama-leetcode-4bit")
tokenizer.save_pretrained("models/llama-leetcode-4bit")


('models/llama-leetcode-4bit/tokenizer_config.json',
 'models/llama-leetcode-4bit/special_tokens_map.json',
 'models/llama-leetcode-4bit/tokenizer.json')

## Evaluation of Translation with BLEU

To evaluate the quality of the model's translations, I used the BLEU score (Bilingual Evaluation Understudy). BLEU is a widely used metric in machine translation that measures how similar the model’s output is to one or more reference translations.

### How BLEU Works:
- It compares n-gram overlaps between the model's generated output and the reference text
- A score of 1.0 means a perfect match; 0.0 means no overlap at all
- BLEU penalizes short or incomplete translations with a brevity penalty
- while not specifically for analyizing code, this metric is a good enough standard to measure how close generated code is to the examples.

### How I Used It:
- I passed Leetcode problems from my test set through the model to generate answers
- Each output was compared against the solution from the dataset
- The BLEU score was computed using the Hugging Face evaluate library

In [17]:
!uv pip install evaluate

[2mUsing Python 3.12.3 environment at: /home/exouser/.venv[0m
[2mAudited [1m1 package[0m [2min 13ms[0m[0m


In [40]:
import evaluate
bleu = evaluate.load("bleu")

predictions = []
references = []
shuffled_dataset = test_dataset.shuffle(seed = 42).select(range(500)) # the range here is the number of samples we are testing.

# Loop through examples
for example in tqdm(shuffled_dataset):
    prompt = example["question_content"].strip()

    text = example["text"]
    if "<|start_header_id|>assistant<|end_header_id|>" in text:
        reference_code = text.split("<|start_header_id|>assistant<|end_header_id|>")[1].split("<|eot_id|>")[0].strip()
    else:
        reference_code = ""
    
    if not reference_code:
        continue

    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": f"{prompt}\n\nWrite a Python function to solve the above problem."}],
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=0.7, 
            min_p=0.1,
            eos_token_id=tokenizer.eos_token_id,
        )

    # decode the output to text
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # fixed some issues relating to how the output was formatted
    if "assistant" in decoded_output:
        decoded_output = decoded_output.split("assistant")[-1].strip()
    # take only the result of the function
    decoded_output = decoded_output.strip()

    predictions.append(decoded_output)
    references.append([reference_code])

100%|██████████| 500/500 [18:17<00:00,  2.19s/it]


Print examples to double check everything is good.

In [43]:
for reference in references[-2:]:
    print(reference[0])

for prediction in predictions[-2:]:
    print(prediction)

def largest_merge(word1: str, word2: str) -> str:
    merge = []
    while word1 or word2:
        if word1 > word2:
            merge.append(word1[0])
            word1 = word1[1:]
        else:
            merge.append(word2[0])
            word2 = word2[1:]
    return ''.join(merge)
def areOccurrencesEqual(self, s: str) -> bool:
        d={}
        for i in s:
            if i in d:
                d[i]+=1
            else:
                d[i]=1
        t=d[s[0]]
        for v in d.values():
            if v!=t:
                return False
        return True
def is_sum_of_two_equal_to_three(nums):
    count = 0
    for i in range(len(nums)):
        for j in range(i + 1, len(nums)):
            if nums[i] + nums[j] == 3:
                count += 1
    return count == 1
def areOccurrencesSame(s: str) -> bool:
    char_count = {}
    for c in s:
        char_count[c] = char_count.get(c, 0) + 1
    return all(x == y for x, y in char_count.items())


In [42]:
# calculate BLEU score.
results = bleu.compute(predictions=predictions, references=references)
print("BLEU Score:", results)

BLEU Score: {'bleu': 0.23789626033025443, 'precisions': [0.5693364135387745, 0.33044148833622516, 0.23164367840980446, 0.1782282249414443], 'brevity_penalty': 0.8013504751096447, 'length_ratio': 0.8186944752915047, 'translation_length': 47183, 'reference_length': 57632}


# Analysis of BLEU Score

The model achieved a BLEU score of **0.237**, which, according to google's documentation of the BLEU function means "The gist is clear, but has significant grammatical errors". This means that the Fine tuning model manages to get the basic structure, syntax etc from training, but the solutions are mostly wrong or
syntactically incorrect. 

  
- **Length Ratio:**  
  - 0/8 — Generated translations are nearly identical in length to reference solutions on average

- **Brevity Penalty:**  
  - 0.8 — Indicates that the model is under-generating; solutions generated aren't the correct length

### Interpretation:
I will have to evaluate BLEU scores on the base model with the same data to see if finetuning actually made a difference or not, which i will get into later.