# Lab 2: Compare Three Models

This notebook compares:
1. **Base Model** (no fine-tuning)
2. **Baseline Fine-Tuned** (r=16, lr=2e-4)
3. **Optimized Fine-Tuned** (best from grid search)

All evaluated on the held-out test set.

In [None]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')
MODEL_SAVE_DIR = "/content/drive/MyDrive/lab2_models"
print(f"Models: {MODEL_SAVE_DIR}")

Mounted at /content/drive
Models: /content/drive/MyDrive/lab2_models


In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

# Same split as training
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
temp = train_test_split['test']
val_test_split = temp.train_test_split(test_size=0.5, seed=42)
test_dataset = val_test_split['test']

print(f"Test set: {len(test_dataset)} examples")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Test set: 10000 examples


In [None]:
from unsloth import FastLanguageModel

# Get tokenizer
_, tokenizer = FastLanguageModel.from_pretrained(
    model_name=os.path.join(MODEL_SAVE_DIR, "baseline_lora_model"),
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    return {"text": [tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False) for c in convos]}

test_dataset = test_dataset.map(formatting_prompts_func, batched=True)
print("‚úì Dataset formatted")

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

Unsloth 2025.11.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

‚úì Dataset formatted


In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorForSeq2Seq
from tqdm import tqdm

def evaluate_model(model, tokenizer, dataset, num_samples=500, batch_size=4):
    model.eval()
    eval_dataset = dataset.select(range(min(num_samples, len(dataset))))

    def tokenize(examples):
        return tokenizer(examples["text"], truncation=True, max_length=max_seq_length,
                        padding=False, return_tensors=None)

    eval_dataset = eval_dataset.map(tokenize, batched=True, remove_columns=eval_dataset.column_names)
    eval_dataset = eval_dataset.filter(lambda x: x["input_ids"] and len(x["input_ids"]) > 0)

    collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True, return_tensors="pt")
    loader = DataLoader(eval_dataset, batch_size=batch_size, collate_fn=collator)

    total_loss, total_samples = 0, 0

    with torch.no_grad():
        for batch in tqdm(loader, desc="Evaluating"):
            batch = {k: v.to(model.device) for k, v in batch.items() if v is not None}
            if "input_ids" not in batch or batch["input_ids"].size(0) == 0:
                continue

            batch["labels"] = batch["input_ids"].clone()
            loss = model(**batch).loss

            total_loss += loss.item() * batch["input_ids"].size(0)
            total_samples += batch["input_ids"].size(0)
            del batch, loss

    if total_samples == 0:
        return {'loss': float('inf'), 'perplexity': float('inf')}

    avg_loss = total_loss / total_samples
    return {'loss': avg_loss, 'perplexity': torch.exp(torch.tensor(avg_loss)).item()}

print("‚úì Evaluation function defined")

‚úì Evaluation function defined


## 1. Evaluate Base Model (No Fine-Tuning)

In [None]:
print("="*80)
print("LOADING BASE MODEL")
print("="*80)

base_model, _ = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

base_metrics = evaluate_model(base_model, tokenizer, test_dataset, num_samples=500)

print(f"\nBase Model (No Fine-Tuning):")
print(f"  Loss:       {base_metrics['loss']:.4f}")
print(f"  Perplexity: {base_metrics['perplexity']:.2f}")
print("="*80)

del base_model
torch.cuda.empty_cache()
import gc
gc.collect()
print("\n‚úì Cleaned up\n")

LOADING BASE MODEL
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [03:00<00:00,  1.45s/it]



Base Model (No Fine-Tuning):
  Loss:       6.9441
  Perplexity: 1037.00

‚úì Cleaned up



## 2. Evaluate Baseline Fine-Tuned Model

In [None]:
print("="*80)
print("LOADING BASELINE FINE-TUNED MODEL")
print("="*80)

baseline_model, _ = FastLanguageModel.from_pretrained(
    model_name=os.path.join(MODEL_SAVE_DIR, "baseline_lora_model"),
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

baseline_metrics = evaluate_model(baseline_model, tokenizer, test_dataset, num_samples=500)

print(f"\nBaseline Fine-Tuned (r=16, lr=2e-4):")
print(f"  Loss:       {baseline_metrics['loss']:.4f}")
print(f"  Perplexity: {baseline_metrics['perplexity']:.2f}")
print("="*80)

del baseline_model
torch.cuda.empty_cache()
gc.collect()
print("\n‚úì Cleaned up\n")

LOADING BASELINE FINE-TUNED MODEL
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [02:54<00:00,  1.39s/it]



Baseline Fine-Tuned (r=16, lr=2e-4):
  Loss:       6.7692
  Perplexity: 870.63

‚úì Cleaned up



## 3. Evaluate Optimized Fine-Tuned Model

In [None]:
print("="*80)
print("LOADING OPTIMIZED FINE-TUNED MODEL")
print("="*80)

optimized_model, _ = FastLanguageModel.from_pretrained(
    model_name=os.path.join(MODEL_SAVE_DIR, "optimized_lora_model"),
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

optimized_metrics = evaluate_model(optimized_model, tokenizer, test_dataset, num_samples=500)

print(f"\nOptimized Fine-Tuned (grid search):")
print(f"  Loss:       {optimized_metrics['loss']:.4f}")
print(f"  Perplexity: {optimized_metrics['perplexity']:.2f}")
print("="*80 + "\n")

LOADING OPTIMIZED FINE-TUNED MODEL
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [02:53<00:00,  1.39s/it]


Optimized Fine-Tuned (grid search):
  Loss:       6.8003
  Perplexity: 898.12






## Final Comparison

In [None]:
print("\n" + "="*80)
print("FINAL COMPARISON")
print("="*80)

print("\n1. Base Model (No Fine-Tuning):")
print(f"   Loss:       {base_metrics['loss']:.4f}")
print(f"   Perplexity: {base_metrics['perplexity']:.2f}")

print("\n2. Baseline Fine-Tuned (r=16, lr=2e-4):")
print(f"   Loss:       {baseline_metrics['loss']:.4f}")
print(f"   Perplexity: {baseline_metrics['perplexity']:.2f}")

print("\n3. Optimized Fine-Tuned (grid search):")
print(f"   Loss:       {optimized_metrics['loss']:.4f}")
print(f"   Perplexity: {optimized_metrics['perplexity']:.2f}")

# Improvements
base_to_baseline = (base_metrics['perplexity'] - baseline_metrics['perplexity']) / base_metrics['perplexity'] * 100
base_to_optimized = (base_metrics['perplexity'] - optimized_metrics['perplexity']) / base_metrics['perplexity'] * 100
baseline_to_optimized = (baseline_metrics['perplexity'] - optimized_metrics['perplexity']) / baseline_metrics['perplexity'] * 100

print("\n" + "-"*80)
print("IMPROVEMENTS:")
print("-"*80)
print(f"Base ‚Üí Baseline FT:      {base_to_baseline:+7.2f}%")
print(f"Base ‚Üí Optimized FT:     {base_to_optimized:+7.2f}%")
print(f"Baseline ‚Üí Optimized FT: {baseline_to_optimized:+7.2f}%")

print("\n" + "-"*80)
print("INTERPRETATION:")
print("-"*80)

if base_to_baseline > 5:
    print("‚úì Fine-tuning significantly improved over base model")
elif base_to_baseline > 0:
    print("‚Üí Fine-tuning slightly improved over base model")
else:
    print("‚ö†Ô∏è Fine-tuning did not improve over base model")

if baseline_to_optimized > 2:
    print(f"‚úì Hyperparameter optimization helped significantly")
elif baseline_to_optimized > 0:
    print(f"‚Üí Hyperparameter optimization helped slightly")
else:
    print(f"‚ö†Ô∏è Hyperparameter optimization did not help")
    print("   (With 60 steps on 1k samples, differences may be noise)")

print("\n" + "="*80)


FINAL COMPARISON

1. Base Model (No Fine-Tuning):
   Loss:       6.9441
   Perplexity: 1037.00

2. Baseline Fine-Tuned (r=16, lr=2e-4):
   Loss:       6.7692
   Perplexity: 870.63

3. Optimized Fine-Tuned (grid search):
   Loss:       6.8003
   Perplexity: 898.12

--------------------------------------------------------------------------------
IMPROVEMENTS:
--------------------------------------------------------------------------------
Base ‚Üí Baseline FT:       +16.04%
Base ‚Üí Optimized FT:      +13.39%
Baseline ‚Üí Optimized FT:   -3.16%

--------------------------------------------------------------------------------
INTERPRETATION:
--------------------------------------------------------------------------------
‚úì Fine-tuning significantly improved over base model
‚ö†Ô∏è Hyperparameter optimization did not help
   (With 60 steps on 1k samples, differences may be noise)



## Summary

This notebook compared three models:
- Base model shows baseline performance without any fine-tuning
- Baseline fine-tuned shows impact of default hyperparameters
- Optimized fine-tuned shows impact of grid search

Results demonstrate the effect of fine-tuning and hyperparameter optimization on model performance.