# Finetuning Llama 3.2 1B on English - Spanish Translations
This notebook demonstrates how I fine-tuned a 1B parameter LLaMA 3.2 model using the [Unsloth](https://github.com/unslothai/unsloth) library for efficient fine tuning. The goal is to teach the model how to generate accurate spanish translations from an english sentence, using natural-language problem descriptions as input.

adapted from: 
- https://www.youtube.com/watch?v=YZW3pkIR-YE&t=505s

other sources:
- https://www.youtube.com/watch?v=bZcKYiwtw1I&t=572s
- https://huggingface.co/docs/trl/en/sft_trainer#format-your-input-prompts
- https://stackoverflow.com/questions/1663807/how-do-i-iterate-through-two-lists-in-parallel
- https://discuss.huggingface.co/t/guide-the-best-way-to-calculate-the-perplexity-of-fixed-length-models/193/2
- https://stackoverflow.com/questions/59209086/calculate-perplexity-in-pytorch
- https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html
- https://en.wikipedia.org/wiki/Perplexity
- https://huggingface.co/docs/transformers/perplexity
- https://huggingface.co/spaces/evaluate-metric/bleu
- https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b/
- https://cloud.google.com/translate/docs/advanced/automl-evaluate


Dataset used: 
- https://huggingface.co/datasets/hadi-myi2/TatoebaEN-ES
- This data is originally from https://tatoeba.org/en/, I downloaded english-spanish sentence pairs from here and uploaded them to huggingface

All imports and libraries needed for this Project

In [8]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import standardize_sharegpt
from datasets import load_dataset
from transformers import TextStreamer
from tqdm import tqdm



# Load Base Llama 3.2 1 B Model
- use 4 bit quantization. This greatly reduces GPU memory usage.

In [9]:

max_seq_length = 2048 
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    GRID A100X-20C. Num GPUs = 1. Max memory: 19.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


# Applying Low Rank Adaptation
- updates only a small number of parameters in specific layers, uses gradient checkpointing. This makes the model lightweight enough for limited hardware.

In [10]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

# Formatting the Data
- This prepares the dataset to be used for fine tuning by converting it into a format Llama 3.2 is compatible with. 
- Use LLaMA 3.1-style chat template for formatting prompts

In [11]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(example):
    prompts = []
    
    for english, spanish in zip(example["EN"], example["ES"]):
        convo = [
            {
                "role": "user",
                "content": f"{english.strip()}\n\n Translate the english content above to spanish.",
            },
            {
                "role": "assistant",
                "content": spanish.strip(),
            }
        ]
        prompt_text = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        prompts.append(prompt_text)
    
    return { "text": prompts }



# load the full dataset
full_dataset = load_dataset("hadi-myi2/TatoebaEN-ES", split="train")

# split into train and test (90% train 10% test)
dataset_split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = dataset_split["train"]
test_dataset = dataset_split["test"]

# preprocess train set
train_dataset = standardize_sharegpt(train_dataset)
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

# preprocess test set
test_dataset = standardize_sharegpt(test_dataset)
test_dataset = test_dataset.map(formatting_prompts_func, batched=True)

Map: 100%|██████████| 244206/244206 [00:10<00:00, 23004.06 examples/s]
Map: 100%|██████████| 27134/27134 [00:01<00:00, 22635.40 examples/s]


Check how the data turned out.

In [28]:
print(test_dataset[5]['text'])
print("eoe")
train_dataset[5]['text']

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Verga is a famous writer.

 Translate the english content above to spanish.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Verga es un escritor famoso.<|eot_id|>
eoe


'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nToday is the day of my predestined meeting.\n\n Translate the english content above to spanish.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHoy es el día de mi cita predestinada.<|eot_id|>'

# Initialize SFTTrainer

- This block contains the training arguments for our finetuning.
This configuration uses:
- A batch size of 2, with gradient accumulation of 4 
- Linear learning rate scheduler with a learning rate of `2e-4`
- 8-bit AdamW optimizer to reduce memory usage
- Mixed-precision training (FP16 or BF16 depending on hardware support)
- 5000 steps of training over the dataset ( different from the leetcode finetuning, because this dataset is much larger)

Other options:
- packing=False: disables packing multiple sequences together 
- dataset_text_field='text': the dataset contains pre-formatted input/output examples under the "text" key
- output_dir="outputs": directory for saving checkpoints


In [17]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #num_train_epochs = 1, 
        max_steps = 5000,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 100,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", 
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2): 100%|██████████| 244206/244206 [00:22<00:00, 11095.04 examples/s]


Make sure prompts look good after tokeinziation by decoding them. Should match the previous block where I printed the dataset

In [18]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nToday is the day of my predestined meeting.\n\n Translate the english content above to spanish.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHoy es el día de mi cita predestinada.<|eot_id|>'

In [19]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = GRID A100X-20C. Max memory = 19.996 GB.
2.914 GB of memory reserved.


# Start Finetuning

In [20]:
trainer_stats = trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 244,206 | Num Epochs = 1 | Total steps = 5,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Step,Training Loss
100,1.058
200,0.5703
300,0.586
400,0.6003
500,0.5804
600,0.5865
700,0.5803
800,0.5682
900,0.5716
1000,0.5505


Unsloth: Will smartly offload gradients to save VRAM!


# Inference on the Fintuned Model 


- After training, we can test the model by feeding it natural language prompts and observing its Spanish text generation. Below is an example where the model is asked to compute various translations of English text.

- We use the llama-3.1 chat template to structure the prompt. The output is then decoded back into human readable text using the tokenizer.


In [21]:

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) 

messages = [
    {"role": "user", "content": "Translate this english text to spanish: 'I am a computer science major at Illinois Wesleyan University'"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, 
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTranslate this english text to spanish: 'I am a computer science major at Illinois Wesleyan University'<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTal vez el texto siguiente en inglés: 'Soy un estudiante en ingeniería de informática de Illinois Wesleyan University'.<|eot_id|>"]

more testing

In [22]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": """
    Translate this english text to spanish: Muriel is 20 years old now 
    """},]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

Muriel es de 20 años ahora.<|eot_id|>


# Evaluating Model Performance: Perplexity

To measure how well our finetuned LLaMA model learned the LeetCode task, we calculate **perplexity**, a common metric for language modeling.

Perplexity reflects how surprised the model is by the correct answer. The lower the score, the better the model's predictive ability.

- Evaluates the model’s ability to predict the next set of tokens.
- Lower values = better fluency and alignment with expected output
- computed perplexity over 1000 examples using a sliding window (stride) approach
Limitation: 
- Perplexity evaluation is only run on the training set, not a validation set. As a result, it measures training effectiveness, but not real-world performance.
### Formula:

**Perplexity = exp(−1/N * Σ log P(xᵢ))**
Where:
- **N** is the number of tokens
- **P(xᵢ)** is the predicted probability of the *i-th* token

In [23]:
# get max input length
max_length = model.config.max_position_embeddings

device = model.device

# https://huggingface.co/docs/datasets/v1.2.0/processing.html used this to reference the shuffle() function
shuffled_dataset = test_dataset.shuffle(seed = 42).select(range(1000)) # the range here is the number of samples we are testing.

# accumulators to keep track of total negative log likelihood and total number of tokens used
nll_sum = 0.0
n_tokens = 0
stride = 512

model.eval()

# loop over each prompt
# reference for explanation of code: https://huggingface.co/docs/transformers/perplexity
for example in tqdm(shuffled_dataset):
    prompt = example['text']
    encodings = tokenizer(prompt, return_tensors="pt")
    input_ids = encodings.input_ids
    seq_len = input_ids.size(1)

    prev_end_loc = 0

    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        inputs = input_ids[:, begin_loc:end_loc].to(device)

        targets = inputs.clone()
        targets[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(inputs, labels = targets)
            n_log = outputs.loss

        num_valid_tokens = (targets != -100).sum().item() 
        num_loss_tokens = num_valid_tokens - inputs.size(0)
        nll_sum += n_log * num_loss_tokens
        n_tokens += num_loss_tokens

        prev_end_loc = end_loc
        if end_loc == seq_len:
            break

avg_nll = nll_sum / n_tokens  # average negative log-likelihood per token
ppl = torch.exp(avg_nll)

print(f"perplexity: {ppl}")

100%|██████████| 1000/1000 [00:40<00:00, 24.42it/s]

perplexity: 1.710835576057434





# Analysis of perplexity
- A perplexity value close to 1.0 indicates that the model is perfect. 
- This model achieved a perplexity of approximately **~1.7**, which indicates that the finetuning was successful.
- This aligns with qualitative inspection of the generated code, which looks mostly syntactically and semantically valid, however, it hasn't been tested quantitatively.

# Save model locally

In [24]:
trainer.model.save_pretrained("models/llama-enes-4bit")
tokenizer.save_pretrained("models/llama-enes-4bit")


('models/llama-enes-4bit/tokenizer_config.json',
 'models/llama-enes-4bit/special_tokens_map.json',
 'models/llama-enes-4bit/tokenizer.json')

- install to run BLEU

In [25]:
!uv pip install evaluate

[2mUsing Python 3.12.3 environment at: /home/exouser/.venv[0m
[2mAudited [1m1 package[0m [2min 6ms[0m[0m


## Evaluation of Translation with BLEU

To evaluate the quality of the model's translations, I used the BLEU score (Bilingual Evaluation Understudy). BLEU is a widely used metric in machine translation that measures how similar the model’s output is to one or more reference translations.

### How BLEU Works:
- It compares n-gram overlaps between the model's generated output and the reference text
- A score of 1.0 means a perfect match; 0.0 means no overlap at all
- BLEU penalizes short or incomplete translations with a brevity penalty

### How I Used It:
- I passed English sentences through the model to generate Spanish translations
- Each output was compared against the Spanish version from the dataset
- The BLEU score was computed using the Hugging Face evaluate library

In [26]:
import evaluate

# load BLEU from HF's evaluate library
bleu = evaluate.load("bleu")

# initialize empty lists to store predictions, outputs from the model, and references, which are corresponding translations from the dataset
predictions = []
references = []
# loop through 1000 random examples from the dataset
for example in tqdm(shuffled_dataset):
    # get english and spanish pairs
    prompt = example["EN"].strip()
    reference_translation = example["ES"].strip()

    # format inputs so they can be fed in the model
    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": f"{prompt}\n\nTranslate the english content above to spanish."}],
        tokenize=True,
        add_generation_prompt= True,
        return_tensors="pt"
    ).to(model.device)

    # Run inference with the model
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=128,
            use_cache=True,
            temperature=0.7,
            min_p=0.1,
            eos_token_id=tokenizer.eos_token_id,
        )
    # decode the output to text
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # fixed some issues relating to how the output was formatted
    if "assistant" in decoded_output:
        decoded_output = decoded_output.split("assistant")[-1].strip()
    # take only the first line, rest of it might be irrelevant. Most translations should be 1 sentence.
    decoded_output = decoded_output.split("\n")[0].strip()

    # store in a list to calculate BLEU later
    predictions.append(decoded_output)
    references.append([reference_translation])

100%|██████████| 1000/1000 [05:29<00:00,  3.04it/s]


In [27]:
# calculate BLEU score.
results = bleu.compute(predictions=predictions, references=references)
print("BLEU Score:", results)

BLEU Score: {'bleu': 0.4211602070536959, 'precisions': [0.6927583198304873, 0.47629218282785135, 0.3554706956666113, 0.26824418373434084], 'brevity_penalty': 1.0, 'length_ratio': 1.0171146044624746, 'translation_length': 8023, 'reference_length': 7888}


# Analysis of BLEU Score

The model achieved a BLEU score of **0.427**, which is a strong result given the simplicity of the finetuning setup. While scores around **0.6+** are typically considered near-perfect for machine translation, a 0.42 score suggests that the model produces outputs that are largely accurate and fluent.

  
- **Length Ratio:**  
  - 1.01 — Generated translations are nearly identical in length to reference translations on average

- **Brevity Penalty:**  
  - 1.0 — Indicates that the model is not under-generating; translations are sufficiently complete

### Interpretation:
Together, these metrics indicate that the fine-tuning was successful, especially considering the limited scale of the model.