# Analysis of Base Model
- In this notebook, I load the base LLaMa 3.2 1B model to evaluate its performance.
- For the translations, I loaded the same dataset I used for the English - Spanish Translations, to test this models BLEU score.
- For the leetcode translations, I used Generative AI to generate prompts for me that were from my dataset.
 

In [1]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import standardize_sharegpt
from datasets import load_dataset
from transformers import TextStreamer
from tqdm import tqdm
!uv pip install evaluate
import evaluate

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
[2mUsing Python 3.12.3 environment at: /home/exouser/.venv[0m
[2mAudited [1m1 package[0m [2min 6ms[0m[0m


# Load Base Model

In [2]:
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# set up template
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

# if needed
FastLanguageModel.for_inference(base_model)

print("loaded model")

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    GRID A100X-20C. Num GPUs = 1. Max memory: 19.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
loaded model


# load dataset EN-ES Dataset

In [6]:
dataset = load_dataset("hadi-myi2/TatoebaEN-ES", split="train")

def formatting_prompts_func(example):
    prompts = []
    
    for english, spanish in zip(example["EN"], example["ES"]):
        convo = [
            {
                "role": "user",
                "content": f"{english.strip()}\n\n Translate the english content above to spanish.",
            },
            {
                "role": "assistant",
                "content": spanish.strip(),
            }
        ]
        prompt_text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        prompts.append(prompt_text)
    
    return { "text": prompts }
dataset = dataset.map(formatting_prompts_func, batched = True,)

dataset[5]['text']

Map: 100%|██████████| 271340/271340 [00:09<00:00, 27489.66 examples/s]


'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nMuiriel is 20 now.\n\n Translate the english content above to spanish.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMuiriel tiene 20 años ahora.<|eot_id|>'

# Calculate BLEU for Base Model - For EN-ES Translation

In [8]:
bleu = evaluate.load("bleu")
shuffled_dataset = dataset.shuffle(seed = 42).select(range(1000))
predictions = []
references = []
for example in tqdm(shuffled_dataset):
    prompt = example["EN"].strip()
    reference_translation = example["ES"].strip()

    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": f"{prompt}\n\nTranslate the english content above to spanish. give just the translation and nothing else"}],
        tokenize=True,
        add_generation_prompt= True,
        return_tensors="pt"
    ).to(base_model.device)

    with torch.no_grad():
        outputs = base_model.generate(
            input_ids=inputs,
            max_new_tokens=128,
            use_cache=True,
            temperature=0.7,
            min_p=0.1,
            eos_token_id=tokenizer.eos_token_id,
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "assistant" in decoded_output:
        decoded_output = decoded_output.split("assistant")[-1].strip()

    decoded_output = decoded_output.split("\n")[0].strip()


    predictions.append(decoded_output)
    references.append([reference_translation])

  0%|          | 0/1000 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 1000/1000 [07:28<00:00,  2.23it/s]


In [9]:
results = bleu.compute(predictions=predictions, references=references)
print("BLEU Score:", results)

BLEU Score: {'bleu': 0.1462780564425135, 'precisions': [0.39728427507665354, 0.19558898957086682, 0.12448728465955701, 0.08413583375570198], 'brevity_penalty': 0.8660465529559768, 'length_ratio': 0.8742660199131989, 'translation_length': 6849, 'reference_length': 7834}


validate predictions manually here to see if score is accurate.

In [13]:
for prediction in predictions[:10]: 
    print(prediction)

La vida ha llegado a su fin.
No content fue proporcionado para traducir.
Patience
Perro amaba pescados.
No gané el caso.
No se proporcionó el contenido original.
Tuvimos que casarnos.
No deje que su tono le dispare
Tom y Mary unieron al mar.
No tengo información sobre la contenido de la página.


# BLEU Analyis - EN-ES Base LLaMa 3.2 1 B
- This bleu score is weak.
- model gets some words correct, but is struggling with structure
- analyzing the predictions and the references manually showed that the translations were very off and there were signs of halucination

# Load LeetCode Style Q&A Dataset

In [3]:
def formatting_prompts_func(example):
    # get the two parts of the dataset we care about - the question and the python function
    question = example["question_content"]
    content = example["content"]

    # for finetuning to work correctly, we need Just the code - no inline comments or explanations
    lines = content.strip().splitlines()
    
    code_lines = []
    in_code_block = False
    seen_function = False

    for line in lines:
        stripped = line.strip()

        # See if we are in a code block
        if stripped.startswith("```python") or stripped.startswith("```"):
            if in_code_block:
                in_code_block = False
            else:
                in_code_block = True
            continue

        if not in_code_block:
            continue

        # find out if it is the start of a function
        if stripped.startswith("def "):
            seen_function = True

        # now we are in a function block 
        if seen_function:
            new_line = ""
            # Track starting of strings, quote character(single/ddouble)
            in_string = False
            quote = None
            # loop iver each character
            for char in line:
                # handle each string start and end
                if char in {"'", '"'}:
                    if in_string and char == quote:
                        in_string = False
                    elif not in_string:
                        in_string = True
                        quote = char
                # stop if we run into # for inline comments
                if not in_string and char == "#":
                    break
                new_line += char
            # append lines that are not empty
            if new_line.strip():
                code_lines.append(new_line.rstrip())
    # cleaned code - join all lines
    code = "\n".join(code_lines).strip()
    
    # return none if not code
    if not code:
        return {"text": None}

    # follow llama's chat template to format. 
    prompt = tokenizer.apply_chat_template(
        [
            {
                "role": "user",
                "content": f"{question.strip()}\n\nWrite a Python function to solve the above problem.",
            },
            {
                "role": "assistant",
                "content": code,
            },
        ],
        tokenize=False,
        add_generation_prompt=False,
    )

    return {"text": prompt}


# load the full dataset
full_dataset = load_dataset("LimYeri/LeetCode_Python_Solutions_v2", split="train")

# split into train and test (90% train 10% test)
dataset_split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

# preprocess train set
train_dataset = standardize_sharegpt(train_dataset)
train_dataset = train_dataset.map(formatting_prompts_func, batched=False)

# preprocess test set
test_dataset = standardize_sharegpt(test_dataset)
test_dataset = test_dataset.map(formatting_prompts_func, batched=False)

# function to filter out invalid samples
def is_valid(example):
    text = example.get("text")
    if not text:
        return False
    if not isinstance(text, str):
        return False
    if not text.strip():
        return False
    if isinstance(text, list):
        return False
    return True

# filter train and test sets
train_dataset = train_dataset.filter(is_valid)
test_dataset = test_dataset.filter(is_valid)

Map: 100%|██████████| 14160/14160 [00:03<00:00, 3977.90 examples/s]
Map: 100%|██████████| 1574/1574 [00:00<00:00, 3947.27 examples/s]
Filter: 100%|██████████| 14160/14160 [00:00<00:00, 81364.03 examples/s]
Filter: 100%|██████████| 1574/1574 [00:00<00:00, 64469.78 examples/s]


# Calculate BLEU for Base Model - For LeetCode Style Q&A
- for some reason, generation for references was painfully slow over here, so I had to stick with using 100 examples, instead of 500 which were used to test BLEU on the FineTuned Leetcode Model

In [8]:
bleu = evaluate.load("bleu")

predictions = []
references = []
shuffled_dataset = test_dataset.shuffle(seed = 42).select(range(100)) # the range here is the number of samples we are testing.

# Loop through examples
for example in tqdm(shuffled_dataset):
    prompt = example["question_content"].strip()

    text = example["text"]
    if "<|start_header_id|>assistant<|end_header_id|>" in text:
        reference_code = text.split("<|start_header_id|>assistant<|end_header_id|>")[1].split("<|eot_id|>")[0].strip()
    else:
        reference_code = ""
    
    if not reference_code:
        continue

    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": f"{prompt}\n\nWrite a Python function to solve the above problem."}],
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(base_model.device)

    with torch.no_grad():
        outputs = base_model.generate(
            input_ids=inputs,
            max_new_tokens=512,
            use_cache=True,
            temperature=0.7, 
            min_p=0.1,
            eos_token_id=tokenizer.eos_token_id,
        )

    # decode the output to text
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # fixed some issues relating to how the output was formatted
    if "assistant" in decoded_output:
        decoded_output = decoded_output.split("assistant")[-1].strip()
    # take only the result of the function
    decoded_output = decoded_output.strip()

    predictions.append(decoded_output)
    references.append([reference_code])

100%|██████████| 100/100 [10:31<00:00,  6.32s/it]


In [9]:
for reference in references[:2]:
    print(reference[0])

for prediction in predictions[:2]:
    print(prediction)

def validPalindrome(self, s: str) -> bool:
        n = len(s)
        if s == s[::-1]:
            return True
        i = 0
        j=n-1
        while(i<=j):
            if s[i] != s[j]:
                if s[i+1: j+1][::-1] == s[i+1: j+1] or s[i: j] == s[i: j][::-1]:
                    return True
                else:
                    return False
            else:
                i += 1
                j -= 1
def uniqueMorseRepresentations(self, words: List[str]) -> int:
        conversion = [".-","-...","-.-.","-..",".","..-.","--.","....","..",".---","-.-",".-..","--","-.","---",".--.","--.-",".-.","...","-","..-","...-",".--","-..-","-.--","--.."]
        alphabet = 'abcdefghijklmnopqrstuvwxyz'
        d = {}
        for i in range(len(alphabet)):
            d[alphabet[i]] = conversion[i]
        transformations = []
        for word in words:
            trans = ''
            for char in word:
                trans += d[char]
            if trans not in transformations:
 

In [10]:
# calculate BLEU score.
results = bleu.compute(predictions=predictions, references=references)
print("BLEU Score:", results)

BLEU Score: {'bleu': 0.05720113865610646, 'precisions': [0.1830039834413809, 0.07280273564958625, 0.03684995681645686, 0.02180587262851295], 'brevity_penalty': 1.0, 'length_ratio': 3.4475361278161745, 'translation_length': 38409, 'reference_length': 11141}


# Analysis of BLEU Scores of LeetCode Style Q&A problems.
- This shows where the BLEU scores fall short
- Although in this scenario the model performs decently, as seen in the outputs above, the outputs are verbose and have extra information, not just the solutions.
- From the manual analysis later, it is evident that the base model performs better, however, these BLEU scores would suggest otherwise.

# Analysis of performance of Leetcode Style Q&A problems. Mostly Manual. 
- I decided to manually inspect a few leetcode problems.
- This should be enough to gauge if the finetuning made a difference, but isn't conclusive or quantitative

In [None]:
# to generate this text, i gave chat GPT some examples from the leetcode dataset and asked it to make consice prompts
prompts = [
    "You are given an integer array `nums` and an integer `k`. Find the maximum subarray sum of all the subarrays of `nums` that meet the following conditions:\n- The length of the subarray is `k`, and\n- All the elements of the subarray are **distinct**.\nReturn the maximum subarray sum of all the subarrays that meet the conditions. If no subarray meets the conditions, return `0`.",
    
    "You are given an array `nums` consisting of positive integers. You are also given an integer array `queries` of size `m`. For the `ith` query, you want to make all of the elements of `nums` equal to `queries[i]`. You can perform the following operation on the array any number of times:\n- Increase or decrease an element of the array by `1`.\nReturn an array `answer` of size `m` where `answer[i]` is the **minimum** number of operations to make all elements of `nums` equal to `queries[i]`.",
    
    "You are a professional robber planning to rob houses along a street. Each house has a certain amount of money stashed, the only constraint stopping you from robbing each of them is that adjacent houses have security systems connected and it will automatically contact the police if two adjacent houses were broken into on the same night.\nGiven an integer array `nums` representing the amount of money of each house, return the maximum amount of money you can rob tonight **without alerting the police**.",
    
    "Given an integer number `n`, return the difference between the product of its digits and the sum of its digits.",
    
    "Design a data structure that follows the constraints of a **Least Recently Used (LRU) cache**.\nImplement the `LRUCache` class:\n- `LRUCache(int capacity)` initializes the LRU cache with positive size `capacity`.\n- `int get(int key)` returns the value of the key if the key exists, otherwise return -1.\n- `void put(int key, int value)` update the value of the key if it exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity, evict the least recently used key.\nThe functions `get` and `put` must each run in O(1) average time complexity.",
    
    "Given a 0-indexed integer array `nums`, return `true` if it can be made **strictly increasing** after removing **exactly one** element, or `false` otherwise. If the array is already strictly increasing, return `true`.",
    
    "You are given an integer array `nums` and an integer `target`.\nYou want to build an **expression** out of nums by adding one of the symbols `'+'` and `'-'` before each integer in nums and then concatenate all the integers.\nReturn the number of different **expressions** that you can build, which evaluates to `target`.",
    
    "Alice and Bob want to water `n` plants in their garden. Each plant needs a specific amount of water. They water the plants simultaneously from opposite ends with watering cans of given capacities.\nWhenever one can isn’t enough, they refill it.\nIn case they meet at the same plant, the one with more water should water it.\nReturn the **number of times** they have to refill to water all the plants.",
    
    "Given two binary trees `original` and `cloned` and a reference to a node `target` in the original tree, return a reference to the same node in the cloned tree.",
    
    "In some array `arr`, the values were in arithmetic progression: the values `arr[i + 1] - arr[i]` are all equal for every `0 <= i < arr.length - 1`.\nA value from `arr` was removed that **was not the first or last value in the array**.\nGiven `arr`, return the removed value.",
    
    "The Fibonacci numbers form a sequence such that each number is the sum of the two preceding ones, starting from `0` and `1`.\nGiven `n`, calculate `F(n)`.",
    
    "You are given a 2D integer array `questions` where each question earns you points and blocks the next `brainpower[i]` questions if you solve it.\nReturn the **maximum** points you can earn for the exam.",
    
    "In a row of dominoes, `tops[i]` and `bottoms[i]` represent the top and bottom halves of the ith domino.\nWe may rotate the ith domino. Return the minimum number of rotations so that all the values in `tops` are the same, or all the values in `bottoms` are the same.\nIf it cannot be done, return `-1`.",
    
    "You are given a string `s` consisting only of letters `'a'` and `'b'`. In a single step you can remove one **palindromic subsequence** from `s`.\nReturn the **minimum** number of steps to make the given string empty.",
    
    "You are given several logs, each with a unique ID and a timestamp in format `Year:Month:Day:Hour:Minute:Second`.\nImplement a `LogSystem` class to `put` logs and `retrieve` logs that fall within a time range, with precision defined by a `granularity`.",
    
    "You are given an integer array `nums`. Pick a number to delete and earn points equal to that number. Then, all elements equal to `nums[i] - 1` and `nums[i] + 1` are also deleted.\nReturn the maximum number of points you can earn by repeating this process.",
    
    "The cafeteria offers sandwiches and students queue to eat. Each student either takes the sandwich on top of the stack or goes to the end of the queue.\nThis continues until no student wants the top sandwich.\nReturn the number of students that are unable to eat.",
    
    "You are given an array `security`, where `security[i]` is the number of guards on day `i`, and an integer `time`.\nReturn all days where the `time` days before are non-increasing and the `time` days after are non-decreasing.",
    
    "Given a string `s`, find the first non-repeating character in it and return its index. If it does not exist, return `-1`.",
    
    "Implement the `myAtoi(string s)` function, which converts a string to a 32-bit signed integer following rules similar to C/C++'s `atoi`.",
    
    "You are given two string arrays `words1` and `words2`.\nA string `a` in `words1` is universal if for every string `b` in `words2`, `b` is a subset of `a`.\nReturn an array of all the universal strings in `words1`.",
    
    "You are given an array `nums` and an integer `k`.\nFind the longest subsequence of `nums` that is strictly increasing and the difference between adjacent elements is at most `k`.",
    
    "Given an array `nums` consisting of `2n` elements in the form `[x1,x2,...,xn,y1,y2,...,yn]`, return the array in the form `[x1,y1,x2,y2,...,xn,yn]`."
]

# print like before, but this time in a loop
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

for prompt in prompts:
    messages = [
        {
            "role": "user",
            "content": prompt
        }
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(base_model.device)

    _ = base_model.generate(
        input_ids=inputs,
        streamer=text_streamer,
        max_new_tokens=512,
        use_cache=True,
        temperature=0.7,
        min_p=0.1,
    )
    print("\n")
    print("end of example") 
    print("\n")

Here is a Python solution for the given problem:

```python
def maxSubarraySum(nums, k):
    # If the length of the subarray is more than k, return 0
    if len(nums) - k + 1 > k:
        return 0

    # Initialize the maximum sum and the current sum
    max_sum = float('-inf')
    current_sum = 0

    # Iterate over the subarray of length k
    for i in range(len(nums) - k + 1):
        # If the current sum is less than 0, reset it to 0
        if current_sum < 0:
            current_sum = 0
        # Update the current sum
        current_sum += nums[i + k]

        # Update the maximum sum
        max_sum = max(max_sum, current_sum)

    return max_sum
```

This solution works by iterating over the subarray of length k and keeping track of the current sum. If the current sum is less than 0, it resets it to 0. This ensures that the maximum sum is not negative.

Here's an explanation of the solution:

1.  Check if the length of the subarray is more than k. If it is, return 0 because t

# Analysis of Manually Generated LeetCode Solution Code from Base Model vs Finetuned Model
- results from the finetuned model are in "4. Testing-finetuned.ipynb"
- Despite some initial promise, the finetuned model’s performance is disappointing. 
- It occasionally shows improved formatting with no unnecessary text, just code, like from the way the model was finetuned and it mimics Python syntax more cleanly with typehints than the base model, its actual logic is obviously broken most times or completely incoherent. 
- In many cases, it fails to solve even basic problems correctly, restates the prompt instead of generating code, or outputs nonsense wrapped in proper function signatures. 
- The base model is actually better, and gets a lot right. The Leetcode FineTuning actually performed wose. 
- Finetuning made it sound smarter without actually making it smarter, and the handful of successful generations are discounted by the other completely incoherrent outputss.
- While calculating BLEU scores, the predictions actually were a lot better - i associate that to the quality of the prompts. That is because the prompts were exactly in the format it was finetuned on, and i did not notice any issues with the prompts ghosting into the output, or repeating inputs.