# Lab 2: Fine-Tuning Llama 3.2 1B with LoRA and Hyperparameter Grid Search

This notebook trains two models:
1. **Baseline fine-tuned model** with default hyperparameters (r=16, lr=2e-4, alpha=16)
2. **Optimized fine-tuned model** with hyperparameters selected via grid search

**Note:** We use Grid Search instead of SHA due to GPU memory constraints on free T4 Colab.

**Evaluation is done in a separate notebook.**

**To run:** Use a **free** Tesla T4 Google Colab instance (Runtime ‚Üí Change runtime type ‚Üí GPU)

---
## 1. Setup

In [None]:
%%capture
# Install required packages
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

In [None]:
from google.colab import drive
import os

# Mount Google Drive for persistent storage
drive.mount('/content/drive')

# Create directories - using _v2 suffix to not overwrite previous runs
CHECKPOINT_DIR = "/content/drive/MyDrive/lab2_checkpoints_v2"
MODEL_SAVE_DIR = "/content/drive/MyDrive/lab2_models_v2"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(MODEL_SAVE_DIR, exist_ok=True)

print(f"Checkpoint directory: {CHECKPOINT_DIR}")
print(f"Model save directory: {MODEL_SAVE_DIR}")

Mounted at /content/drive
Checkpoint directory: /content/drive/MyDrive/lab2_checkpoints_v2
Model save directory: /content/drive/MyDrive/lab2_models_v2


---
## 2. Load Base Model and Tokenizer

In [None]:
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True

model_name = "unsloth/Llama-3.2-1B"

print(f"Model: {model_name}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Model: unsloth/Llama-3.2-1B


---
## 3. Load and Split Dataset

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template

# Load FineTome dataset
print("Loading FineTome-100k dataset...")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

# Split: 80% train, 10% validation, 10% test
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
temp_dataset = train_test_split['test']

val_test_split = temp_dataset.train_test_split(test_size=0.5, seed=42)
val_dataset = val_test_split['train']
test_dataset = val_test_split['test']

print(f"\nDataset Split:")
print(f"  Training:   {len(train_dataset):6d}")
print(f"  Validation: {len(val_dataset):6d}")
print(f"  Test:       {len(test_dataset):6d}")

Loading FineTome-100k dataset...


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]


Dataset Split:
  Training:    80000
  Validation:  10000
  Test:        10000


In [None]:
# Load tokenizer and apply chat template
_, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

def formatting_prompts_func(examples):
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in examples["conversations"]
    ]
    return {"text": texts}

print("Formatting datasets...")
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
val_dataset = val_dataset.map(formatting_prompts_func, batched=True)
test_dataset = test_dataset.map(formatting_prompts_func, batched=True)
print("‚úì Done")

==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Formatting datasets...


Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

‚úì Done


---
## 4. Hyperparameter Grid Search

We test different combinations of LoRA hyperparameters and select the best one
based on the trainer's eval_loss (computed consistently with training objective).

In [None]:
# Hyperparameter configurations to test
hyperparameter_configs = [
    {"r": 8,  "lr": 2e-4, "alpha": 16, "name": "config_1"},
    {"r": 16, "lr": 2e-4, "alpha": 16, "name": "config_2"},
    {"r": 32, "lr": 2e-4, "alpha": 32, "name": "config_3"},
    {"r": 16, "lr": 1e-4, "alpha": 16, "name": "config_4"},
    {"r": 16, "lr": 5e-5, "alpha": 16, "name": "config_5"},
    {"r": 32, "lr": 1e-4, "alpha": 32, "name": "config_6"},
    {"r": 32, "lr": 2e-4, "alpha": 64, "name": "config_7"},
    {"r": 32, "lr": 1e-4, "alpha": 64, "name": "config_8"},
]

print(f"Configurations to test: {len(hyperparameter_configs)}")
for i, c in enumerate(hyperparameter_configs, 1):
    print(f"  {i}. {c['name']}: r={c['r']}, lr={c['lr']:.0e}, alpha={c['alpha']}")

Configurations to test: 8
  1. config_1: r=8, lr=2e-04, alpha=16
  2. config_2: r=16, lr=2e-04, alpha=16
  3. config_3: r=32, lr=2e-04, alpha=32
  4. config_4: r=16, lr=1e-04, alpha=16
  5. config_5: r=16, lr=5e-05, alpha=16
  6. config_6: r=32, lr=1e-04, alpha=32
  7. config_7: r=32, lr=2e-04, alpha=64
  8. config_8: r=32, lr=1e-04, alpha=64


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only
import gc

def train_configuration(config, train_subset, val_subset, steps):
    """
    Train a configuration and return the final eval_loss from the trainer.

    This ensures eval_loss is computed the same way as training loss
    (only on assistant responses via train_on_responses_only).
    """
    print(f"  Training {config['name']}: r={config['r']}, lr={config['lr']:.0e}, alpha={config['alpha']}")

    # Load fresh model
    model, _ = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_alpha=config["alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
    )

    output_dir = os.path.join(CHECKPOINT_DIR, f"grid_search_{config['name']}")

    training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=steps,
        learning_rate=config["lr"],
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir=output_dir,
        # Evaluate at the end
        eval_strategy="steps",
        eval_steps=steps,
        per_device_eval_batch_size=4,
        save_strategy="no",  # Don't save checkpoints for grid search
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_subset,
        eval_dataset=val_subset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
        dataset_num_proc=2,
        packing=False,
        args=training_args,
    )

    # Train only on assistant responses - applies to both train and eval
    trainer = train_on_responses_only(
        trainer,
        instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
        response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
    )

    trainer.train()

    # Get eval_loss from trainer's log history
    eval_loss = None
    for log_entry in reversed(trainer.state.log_history):
        if 'eval_loss' in log_entry:
            eval_loss = log_entry['eval_loss']
            break

    if eval_loss is None:
        eval_results = trainer.evaluate()
        eval_loss = eval_results['eval_loss']

    # Cleanup
    del model, trainer
    torch.cuda.empty_cache()
    gc.collect()

    return eval_loss

print("‚úì Training function defined")

‚úì Training function defined


In [None]:
import time

# Grid search settings
N_TRAIN_SAMPLES = 1000
N_VAL_SAMPLES = 200
N_STEPS = 60

train_subset = train_dataset.select(range(N_TRAIN_SAMPLES))
val_subset = val_dataset.select(range(N_VAL_SAMPLES))

print("="*70)
print("HYPERPARAMETER GRID SEARCH")
print("="*70)
print(f"Training samples: {N_TRAIN_SAMPLES}, Val samples: {N_VAL_SAMPLES}, Steps: {N_STEPS}")
print("="*70)

grid_search_results = []

for idx, config in enumerate(hyperparameter_configs, 1):
    print(f"\n[{idx}/{len(hyperparameter_configs)}]")

    start_time = time.time()

    try:
        eval_loss = train_configuration(config, train_subset, val_subset, N_STEPS)
        elapsed = time.time() - start_time

        grid_search_results.append({
            'config': config,
            'eval_loss': eval_loss,
            'time': elapsed,
        })
        print(f"  ‚úì eval_loss: {eval_loss:.4f} ({elapsed/60:.1f} min)")

    except Exception as e:
        print(f"  ‚úó Failed: {e}")
        torch.cuda.empty_cache()
        gc.collect()

# Sort by eval_loss
grid_search_results.sort(key=lambda x: x['eval_loss'])

print(f"\n{'='*70}")
print("RESULTS (sorted by eval_loss)")
print(f"{'='*70}")
for i, r in enumerate(grid_search_results, 1):
    c = r['config']
    print(f"{i}. {c['name']}: eval_loss={r['eval_loss']:.4f} (r={c['r']}, lr={c['lr']:.0e}, alpha={c['alpha']})")

best_config = grid_search_results[0]['config']
print(f"\nüèÜ Best: {best_config['name']} (r={best_config['r']}, lr={best_config['lr']:.0e}, alpha={best_config['alpha']})")

HYPERPARAMETER GRID SEARCH
Training samples: 1000, Val samples: 200, Steps: 60

[1/8]
  Training config_1: r=8, lr=2e-04, alpha=16
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.11.6 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,636,096 of 1,241,450,496 (0.45% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
60,1.0671,1.071579


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


  ‚úì eval_loss: 1.0716 (4.2 min)

[2/8]
  Training config_2: r=16, lr=2e-04, alpha=16
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/200 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss,Validation Loss
60,1.067,1.071389


  ‚úì eval_loss: 1.0714 (3.9 min)

[3/8]
  Training config_3: r=32, lr=2e-04, alpha=32
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss,Validation Loss
60,1.057,1.063497


  ‚úì eval_loss: 1.0635 (3.7 min)

[4/8]
  Training config_4: r=16, lr=1e-04, alpha=16
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss,Validation Loss
60,1.09,1.090458


  ‚úì eval_loss: 1.0905 (3.7 min)

[5/8]
  Training config_5: r=16, lr=5e-05, alpha=16
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss,Validation Loss
60,1.1218,1.120596


  ‚úì eval_loss: 1.1206 (3.7 min)

[6/8]
  Training config_6: r=32, lr=1e-04, alpha=32
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss,Validation Loss
60,1.0752,1.077945


  ‚úì eval_loss: 1.0779 (3.7 min)

[7/8]
  Training config_7: r=32, lr=2e-04, alpha=64
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss,Validation Loss
60,1.0497,1.057464


  ‚úì eval_loss: 1.0575 (3.7 min)

[8/8]
  Training config_8: r=32, lr=1e-04, alpha=64
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss,Validation Loss
60,1.0626,1.067951


  ‚úì eval_loss: 1.0680 (3.7 min)

RESULTS (sorted by eval_loss)
1. config_7: eval_loss=1.0575 (r=32, lr=2e-04, alpha=64)
2. config_3: eval_loss=1.0635 (r=32, lr=2e-04, alpha=32)
3. config_8: eval_loss=1.0680 (r=32, lr=1e-04, alpha=64)
4. config_2: eval_loss=1.0714 (r=16, lr=2e-04, alpha=16)
5. config_1: eval_loss=1.0716 (r=8, lr=2e-04, alpha=16)
6. config_6: eval_loss=1.0779 (r=32, lr=1e-04, alpha=32)
7. config_4: eval_loss=1.0905 (r=16, lr=1e-04, alpha=16)
8. config_5: eval_loss=1.1206 (r=16, lr=5e-05, alpha=16)

üèÜ Best: config_7 (r=32, lr=2e-04, alpha=64)


---
## 5. Train Final Models

Train both the optimized model (with best hyperparameters) and baseline model (default hyperparameters) on the full training set.

In [None]:
def train_final_model(config, output_name, train_data, max_steps=1000):
    """
    Train a final model on full training data with checkpointing.
    """
    print(f"\nTraining {output_name}...")
    print(f"  Config: r={config['r']}, lr={config['lr']:.0e}, alpha={config['alpha']}")
    print(f"  Training on {len(train_data)} examples for {max_steps} steps")

    model, _ = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_alpha=config["alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
    )

    output_dir = os.path.join(CHECKPOINT_DIR, output_name)

    training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=20,
        max_steps=max_steps,
        learning_rate=config["lr"],
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=20,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir=output_dir,
        save_strategy="steps",
        save_steps=200,
        save_total_limit=3,
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_data,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
        dataset_num_proc=2,
        packing=False,
        args=training_args,
    )

    trainer = train_on_responses_only(
        trainer,
        instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
        response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
    )

    # Resume from checkpoint if exists
    checkpoint_path = None
    if os.path.exists(output_dir):
        checkpoints = [d for d in os.listdir(output_dir) if d.startswith("checkpoint")]
        if checkpoints:
            latest = max(checkpoints, key=lambda x: int(x.split("-")[1]))
            checkpoint_path = os.path.join(output_dir, latest)
            print(f"  Resuming from: {checkpoint_path}")

    trainer.train(resume_from_checkpoint=checkpoint_path)
    print(f"  ‚úì Training complete")

    return model

print("‚úì Final training function defined")

‚úì Final training function defined


In [None]:
print("="*70)
print("TRAINING OPTIMIZED MODEL (best hyperparameters from grid search)")
print("="*70)

optimized_model = train_final_model(
    config=best_config,
    output_name="final_optimized",
    train_data=train_dataset,
    max_steps=1000
)

TRAINING OPTIMIZED MODEL (best hyperparameters from grid search)

Training final_optimized...
  Config: r=32, lr=2e-04, alpha=64
  Training on 80000 examples for 1000 steps
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/80000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/80000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss
20,1.0656
40,1.1269
60,1.0407
80,1.0374
100,1.0094
120,0.9904
140,1.0324
160,1.0282
180,1.0164
200,1.0588


  ‚úì Training complete


In [None]:
# Save optimized model
optimized_save_path = os.path.join(MODEL_SAVE_DIR, "optimized_lora")
print(f"Saving optimized model to: {optimized_save_path}")
optimized_model.save_pretrained(optimized_save_path)
tokenizer.save_pretrained(optimized_save_path)
print("‚úì Saved")

# Cleanup
del optimized_model
torch.cuda.empty_cache()
gc.collect()

Saving optimized model to: /content/drive/MyDrive/lab2_models_v2/optimized_lora
‚úì Saved


25218

In [None]:
print("="*70)
print("TRAINING BASELINE MODEL (default hyperparameters)")
print("="*70)

baseline_config = {"r": 16, "lr": 2e-4, "alpha": 16, "name": "baseline"}

baseline_model = train_final_model(
    config=baseline_config,
    output_name="final_baseline",
    train_data=train_dataset,
    max_steps=1000
)

TRAINING BASELINE MODEL (default hyperparameters)

Training final_baseline...
  Config: r=16, lr=2e-04, alpha=16
  Training on 80000 examples for 1000 steps
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
20,1.095
40,1.1437
60,1.0522
80,1.0454
100,1.0145
120,0.995
140,1.0354
160,1.0312
180,1.0184
200,1.0605


In [None]:
# Save baseline model
baseline_save_path = os.path.join(MODEL_SAVE_DIR, "baseline_lora")
print(f"Saving baseline model to: {baseline_save_path}")
baseline_model.save_pretrained(baseline_save_path)
tokenizer.save_pretrained(baseline_save_path)
print("‚úì Saved")

# Cleanup
del baseline_model
torch.cuda.empty_cache()
gc.collect()

---
## 6. Summary

### Models Saved

- `{MODEL_SAVE_DIR}/optimized_lora/` - Model with grid-search-optimized hyperparameters
- `{MODEL_SAVE_DIR}/baseline_lora/` - Model with default hyperparameters (r=16, lr=2e-4, alpha=16)

### Next Steps

Use the evaluation notebook to compare:
1. Base model (no fine-tuning)
2. Baseline fine-tuned model
3. Optimized fine-tuned model

In [None]:
print("="*70)
print("TRAINING COMPLETE")
print("="*70)
print(f"\nBest hyperparameters found:")
print(f"  r={best_config['r']}, lr={best_config['lr']:.0e}, alpha={best_config['alpha']}")
print(f"\nModels saved to:")
print(f"  Optimized: {MODEL_SAVE_DIR}/optimized_lora/")
print(f"  Baseline:  {MODEL_SAVE_DIR}/baseline_lora/")
print(f"\n‚Üí Use the evaluation notebook to compare model performance.")

### Training Budget Justification

We train for **1000 steps** with an effective batch size of 8 (batch_size=2 √ó gradient_accumulation=4), which means the model sees ~8,000 examples. With a training set of ~80,000 examples, this corresponds to roughly **10% of one epoch**.

A full epoch would require ~10,000 steps, which is impractical given:
- **Colab time limits**: Free T4 instances have session timeouts and usage limits
- **Training time**: 1000 steps already takes significant time; 10x more would risk disconnections
- **Diminishing returns**: For comparing hyperparameter configurations, partial training is often sufficient to observe meaningful differences

This is an acceptable tradeoff for a lab assignment focused on demonstrating the hyperparameter optimization methodology. Longer training would likely improve absolute performance but is not necessary to validate that the grid search selects better hyperparameters than defaults.