# Lab 2: Fine-Tuning Llama 3.2 with Hyperparameter Optimization

This notebook implements:
1. Train/Val/Test split of FineTome dataset
2. Successive Halving Algorithm (SHA) for hyperparameter tuning
3. Checkpoint saving and resumption
4. Comprehensive evaluation comparing:
   - Base model (no fine-tuning)
   - Fine-tuned model (no hyperparameter optimization)
   - Fine-tuned model (with optimized hyperparameters)

**To run this:** Press "Runtime" ‚Üí "Run all" on a **free** Tesla T4 Google Colab instance!

**Important:** Make sure to enable GPU (Runtime ‚Üí Change runtime type ‚Üí GPU)

## Setup: Install Dependencies

In [None]:
%%capture
# Install required packages
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

## Mount Google Drive for Checkpoint Persistence

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create directories for checkpoints and models
CHECKPOINT_DIR = "/content/drive/MyDrive/lab2_checkpoints"
MODEL_SAVE_DIR = "/content/drive/MyDrive/lab2_models"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(MODEL_SAVE_DIR, exist_ok=True)

print(f"Checkpoint directory: {CHECKPOINT_DIR}")
print(f"Model save directory: {MODEL_SAVE_DIR}")

Mounted at /content/drive
Checkpoint directory: /content/drive/MyDrive/lab2_checkpoints
Model save directory: /content/drive/MyDrive/lab2_models


## Load Base Model and Tokenizer

We use Llama 3.2 1B for faster experimentation. You can change to 3B for better quality.

In [None]:
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048  # Choose any! Auto support for RoPE Scaling
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage

# Supported models: https://github.com/unslothai/unsloth
model_name = "unsloth/Llama-3.2-1B"  # or "unsloth/Llama-3.2-3B"

print(f"Loading base model: {model_name}")
print("This will be used for baseline evaluation...")

# Load model and tokenizer
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("‚úì Base model loaded successfully!")
print(f"Model type: {type(base_model)}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Loading base model: unsloth/Llama-3.2-1B
This will be used for baseline evaluation...
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

‚úì Base model loaded successfully!
Model type: <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>


## Load and Split Dataset

**Critical:** We split BEFORE any training to ensure proper evaluation.

Split strategy:
- Train: 80% (for fine-tuning)
- Validation: 10% (for hyperparameter selection)
- Test: 10% (for final evaluation, never seen during training/tuning)

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt

# Load FineTome dataset
print("Loading FineTome-100k dataset...")
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
print(f"Total examples: {len(dataset)}")

# Standardize to HuggingFace format
dataset = standardize_sharegpt(dataset)
print("‚úì Dataset standardized")

# Split dataset: 80% train, 10% validation, 10% test
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split['train']
temp_dataset = train_test_split['test']

# Split the 20% into validation and test (50-50)
val_test_split = temp_dataset.train_test_split(test_size=0.5, seed=42)
val_dataset = val_test_split['train']
test_dataset = val_test_split['test']

print(f"\n{'='*60}")
print("Dataset Split:")
print(f"{'='*60}")
print(f"Training set:   {len(train_dataset):6d} examples ({len(train_dataset)/len(dataset)*100:.1f}%)")
print(f"Validation set: {len(val_dataset):6d} examples ({len(val_dataset)/len(dataset)*100:.1f}%)")
print(f"Test set:       {len(test_dataset):6d} examples ({len(test_dataset)/len(dataset)*100:.1f}%)")
print(f"{'='*60}\n")

# Show example
print("Example conversation:")
print(train_dataset[5]["conversations"])

Loading FineTome-100k dataset...


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Total examples: 100000


Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

‚úì Dataset standardized

Dataset Split:
Training set:    80000 examples (80.0%)
Validation set:  10000 examples (10.0%)
Test set:        10000 examples (10.0%)

Example conversation:
[{'content': 'Find the length of the hypotenuse given two sides\nside_length1 = 5, side_length2 = 6', 'role': 'user'}, {'content': 'To find the length of the hypotenuse given the two sides of a right triangle, you can use the Pythagorean theorem. The theorem states that the square of the length of the hypotenuse is equal to the sum of the squares of the lengths of the two other sides.\n \nIn this case, the length of one side is 5, and the length of the other side is 6. Plugging these values into the formula, we get:\n\nhypotenuse2 = 5^2 + 6^2\n                 = 25 + 36\n                 = 61\n\nTo get the length of the hypotenuse, take the square root of 61:\n\nhypotenuse = ‚àö61\n                 = 7.81 (rounded to two decimal places)\n\nTherefore, the length of the hypotenuse is approximately 7.81 unit

## Setup Chat Template and Formatting Function

In [None]:
from unsloth.chat_templates import get_chat_template

# Apply Llama 3.1 chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

def formatting_prompts_func(examples):
    """Format conversations using chat template"""
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

# Apply formatting to all splits
print("Formatting datasets with chat template...")
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
val_dataset = val_dataset.map(formatting_prompts_func, batched=True)
test_dataset = test_dataset.map(formatting_prompts_func, batched=True)
print("‚úì Datasets formatted")

# Show formatted example
print("\nFormatted example:")
print(train_dataset[5]["text"][:500] + "...")

Formatting datasets with chat template...


Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

‚úì Datasets formatted

Formatted example:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Find the length of the hypotenuse given two sides
side_length1 = 5, side_length2 = 6<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To find the length of the hypotenuse given the two sides of a right triangle, you can use the Pythagorean theorem. The theorem states that the square of the length of the hypotenuse ...


## Evaluation Function

This function calculates loss and perplexity on a dataset.

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorForSeq2Seq
from tqdm import tqdm

def evaluate_model(model, tokenizer, dataset, num_samples=500, batch_size=4):
    """
    Evaluate model on a dataset

    Args:
        model: The model to evaluate
        tokenizer: Tokenizer
        dataset: Dataset to evaluate on
        num_samples: Number of samples to evaluate (for speed)
        batch_size: Batch size for evaluation

    Returns:
        dict: {'loss': float, 'perplexity': float}
    """
    model.eval()

    # Select subset for evaluation
    eval_dataset = dataset.select(range(min(num_samples, len(dataset))))

    # Tokenize
    def tokenize_function(examples):
        tokenized = tokenizer(
            examples["text"],
            truncation=True,
            max_length=max_seq_length,
            padding=False,
            return_tensors=None,  # Important: return lists, not tensors
        )
        return tokenized

    eval_dataset = eval_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=eval_dataset.column_names,
    )

    # Filter out any problematic samples
    eval_dataset = eval_dataset.filter(
        lambda x: x["input_ids"] is not None and len(x["input_ids"]) > 0
    )

    print(f"Evaluating on {len(eval_dataset)} samples...")

    # Create dataloader
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        padding=True,
        return_tensors="pt"
    )

    dataloader = DataLoader(
        eval_dataset,
        batch_size=batch_size,
        collate_fn=data_collator,
    )

    total_loss = 0
    total_samples = 0

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            # Filter out None values and move to device
            batch = {k: v.to(model.device) for k, v in batch.items() if v is not None}

            # Skip if batch is empty
            if "input_ids" not in batch or batch["input_ids"].size(0) == 0:
                continue

            # Set labels (for loss calculation)
            batch["labels"] = batch["input_ids"].clone()

            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss

            total_loss += loss.item() * batch["input_ids"].size(0)
            total_samples += batch["input_ids"].size(0)

    if total_samples == 0:
        print("Warning: No samples were successfully evaluated!")
        return {'loss': float('inf'), 'perplexity': float('inf')}

    avg_loss = total_loss / total_samples
    perplexity = torch.exp(torch.tensor(avg_loss)).item()

    model.train()

    return {
        'loss': avg_loss,
        'perplexity': perplexity,
    }

print("‚úì Evaluation function defined")

‚úì Evaluation function defined


## Baseline Evaluation: Base Model (No Fine-Tuning)

First, let's evaluate the base model before any fine-tuning.

In [None]:
print("="*60)
print("BASELINE EVALUATION: Base Model (No Fine-Tuning)")
print("="*60)

print("\nEvaluating on validation set...")
base_val_metrics = evaluate_model(base_model, tokenizer, val_dataset, num_samples=500)

print("\nEvaluating on test set...")
base_test_metrics = evaluate_model(base_model, tokenizer, test_dataset, num_samples=500)

print(f"\n{'='*60}")
print("Base Model Results:")
print(f"{'='*60}")
print(f"Validation - Loss: {base_val_metrics['loss']:.4f}, Perplexity: {base_val_metrics['perplexity']:.2f}")
print(f"Test       - Loss: {base_test_metrics['loss']:.4f}, Perplexity: {base_test_metrics['perplexity']:.2f}")
print(f"{'='*60}\n")

# Save baseline metrics
baseline_metrics = {
    'val_loss': base_val_metrics['loss'],
    'val_perplexity': base_val_metrics['perplexity'],
    'test_loss': base_test_metrics['loss'],
    'test_perplexity': base_test_metrics['perplexity'],
}

BASELINE EVALUATION: Base Model (No Fine-Tuning)

Evaluating on validation set...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Evaluating on 500 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [03:05<00:00,  1.49s/it]



Evaluating on test set...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Evaluating on 500 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [02:43<00:00,  1.31s/it]


Base Model Results:
Validation - Loss: 8.4768, Perplexity: 4802.12
Test       - Loss: 8.1829, Perplexity: 3579.15






In [None]:
import gc
# ================================================================
# CRITICAL: Clean up base model before hyperparameter tuning
# ================================================================

print("\n" + "="*80)
print("PREPARING FOR HYPERPARAMETER OPTIMIZATION")
print("="*80)
print("\nCleaning up base model from GPU memory...")

# Delete base model (we saved the metrics, don't need the model anymore)
del base_model
torch.cuda.empty_cache()
gc.collect()

# Verify memory is freed
allocated_gb = torch.cuda.memory_allocated() / 1024**3
reserved_gb = torch.cuda.memory_reserved() / 1024**3
print(f"‚úì Base model removed")
print(f"  Current GPU memory: {allocated_gb:.2f} GB allocated, {reserved_gb:.2f} GB reserved")
print(f"  Ready to start hyperparameter tuning with clean GPU memory!")
print("="*80 + "\n")


PREPARING FOR HYPERPARAMETER OPTIMIZATION

Cleaning up base model from GPU memory...
‚úì Base model removed
  Current GPU memory: 0.00 GB allocated, 1.05 GB reserved
  Ready to start hyperparameter tuning with clean GPU memory!



## Hyperparameter Configuration for SHA

We'll test different combinations of:
- LoRA rank (r)
- Learning rate
- LoRA alpha

In [None]:
# Define hyperparameter configurations to test
hyperparameter_configs = [
    # Format: {"r": rank, "lr": learning_rate, "alpha": lora_alpha}
    {"r": 8,  "lr": 2e-4, "alpha": 16, "name": "config_1"},
    {"r": 16, "lr": 2e-4, "alpha": 16, "name": "config_2"},
    {"r": 32, "lr": 2e-4, "alpha": 32, "name": "config_3"},
    {"r": 16, "lr": 1e-4, "alpha": 16, "name": "config_4"},
    {"r": 16, "lr": 5e-5, "alpha": 16, "name": "config_5"},
    {"r": 32, "lr": 1e-4, "alpha": 32, "name": "config_6"},
    {"r": 32, "lr": 2e-4, "alpha": 64, "name": "config_7"},
    {"r": 32, "lr": 1e-4, "alpha": 64, "name": "config_8"},
]

print(f"Number of hyperparameter configurations: {len(hyperparameter_configs)}")
print("\nConfigurations:")
for i, config in enumerate(hyperparameter_configs, 1):
    print(f"{i}. {config['name']}: r={config['r']}, lr={config['lr']}, alpha={config['alpha']}")

Number of hyperparameter configurations: 8

Configurations:
1. config_1: r=8, lr=0.0002, alpha=16
2. config_2: r=16, lr=0.0002, alpha=16
3. config_3: r=32, lr=0.0002, alpha=32
4. config_4: r=16, lr=0.0001, alpha=16
5. config_5: r=16, lr=5e-05, alpha=16
6. config_6: r=32, lr=0.0001, alpha=32
7. config_7: r=32, lr=0.0002, alpha=64
8. config_8: r=32, lr=0.0001, alpha=64


## Hyperparameter Tuning Approach

### Initial Plan: Successive Halving Algorithm (SHA)

SHA Strategy:
1. **Round 1**: Train all configs on 1k examples ‚Üí keep best 50%
2. **Round 2**: Train top 50% on 2k examples ‚Üí keep best 50%
3. **Round 3**: Train top 50% on 4k examples ‚Üí keep best 50%
4. **Round 4**: Train top configs on 8k examples ‚Üí select best

### What Actually Happened: Grid Search on Round 1 Only

**Why we couldn't complete SHA:**
- Round 1 (1k samples, 60 steps) completed successfully for all 8 configurations
- Attempted Round 2 with 2k samples on the top 4 configurations
- **Encountered CUDA Out of Memory errors** with the T4 GPU (16GB VRAM)
- Training with 2k samples exceeded available GPU memory even with 4-bit quantization and gradient checkpointing

**Decision: Single-Round Grid Search**

Given the memory constraints:
1. We completed a full grid search on Round 1 (all 8 configs on 1k samples, 60 steps)
2. Selected the best configuration based on validation loss
3. Trained the winning configuration for the full 1000 steps on the complete training set

While this doesn't provide the full benefits of SHA's progressive elimination, it still systematically explores the hyperparameter space within our GPU constraints. Unfortunately it was not sufficient to see a true improvement in performance



In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only
import copy

# SHA configuration
sha_rounds = [
    {"n_samples": 1000, "steps": 60},
    {"n_samples": 2000, "steps": 120},
    {"n_samples": 4000, "steps": 240},
    {"n_samples": 8000, "steps": 480},
]

print("SHA Configuration:")
for i, round_config in enumerate(sha_rounds, 1):
    print(f"Round {i}: {round_config['n_samples']} samples, {round_config['steps']} steps")

def train_configuration(config, train_subset, steps, round_num, config_idx):
    """
    Train a single hyperparameter configuration

    Args:
        config: Hyperparameter configuration dict
        train_subset: Training dataset subset
        steps: Number of training steps
        round_num: SHA round number
        config_idx: Configuration index

    Returns:
        tuple: (model, validation_loss)
    """
    print(f"\n  Training {config['name']}: r={config['r']}, lr={config['lr']}, alpha={config['alpha']}")

    # Load fresh base model for this configuration
    model, _ = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    # Add LoRA adapters with specified hyperparameters
    model = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_alpha=config["alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
        use_rslora=False,
        loftq_config=None,
    )

    # Setup output directory for checkpoints
    output_dir = os.path.join(CHECKPOINT_DIR, f"round{round_num}_{config['name']}")

    # Training arguments with checkpointing
    training_args = TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=steps,
        learning_rate=config["lr"],
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir=output_dir,
        save_strategy="steps",
        save_steps=max(steps // 3, 20),  # Save 3 checkpoints during training
        save_total_limit=2,  # Keep only last 2 checkpoints
        report_to="none",
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_subset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
        dataset_num_proc=2,
        packing=False,
        args=training_args,
    )

    # Train only on assistant responses
    trainer = train_on_responses_only(
        trainer,
        instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
        response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
    )

    # Check for existing checkpoint to resume from
    checkpoint_path = None
    if os.path.exists(output_dir):
        checkpoints = [d for d in os.listdir(output_dir) if d.startswith("checkpoint")]
        if checkpoints:
            latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1]))
            checkpoint_path = os.path.join(output_dir, latest_checkpoint)
            print(f"  Resuming from checkpoint: {checkpoint_path}")

    # Train
    trainer.train(resume_from_checkpoint=checkpoint_path)

    # Evaluate on validation set
    print(f"  Evaluating {config['name']} on validation set...")
    val_metrics = evaluate_model(model, tokenizer, val_dataset, num_samples=200)
    val_loss = val_metrics['loss']

    print(f"  {config['name']} validation loss: {val_loss:.4f}, perplexity: {val_metrics['perplexity']:.2f}")

    return model, val_loss

print("\n‚úì Training function defined")

SHA Configuration:
Round 1: 1000 samples, 60 steps
Round 2: 2000 samples, 120 steps
Round 3: 4000 samples, 240 steps
Round 4: 8000 samples, 480 steps

‚úì Training function defined


Here is the grid search (no SHA) that produces the best configuration among the possible ones on 60 training steps that whose checkpoints are progressively saved during training (we tried to interrupt training and the model was able to resume)

In [None]:
import time
import gc

# Track all results
grid_search_results = []

# All configurations to test
all_configs = hyperparameter_configs.copy()

print("\n" + "="*80)
print("HYPERPARAMETER GRID SEARCH")
print("="*80)
print(f"Testing {len(all_configs)} configurations")
print("="*80)

# Training configuration
n_samples = 1000
n_steps = 60
train_subset = train_dataset.select(range(n_samples))

print(f"\nTraining each config on {n_samples} samples for {n_steps} steps\n")

# Train each configuration
for config_idx, config in enumerate(all_configs, 1):
    print(f"\n[{config_idx}/{len(all_configs)}] Training {config['name']}: r={config['r']}, lr={config['lr']}, alpha={config['alpha']}")

    start_time = time.time()

    # Clean memory before each config
    torch.cuda.empty_cache()
    gc.collect()

    try:
        model, val_loss = train_configuration(
            config,
            train_subset,
            n_steps,
            round_num=1,
            config_idx=config_idx
        )

        training_time = time.time() - start_time

        result = {
            'config': config,
            'val_loss': val_loss,
            'n_samples': n_samples,
            'steps': n_steps,
            'training_time': training_time,
        }
        grid_search_results.append(result)

        print(f"  ‚úì {config['name']} completed in {training_time/60:.1f} minutes - val_loss: {val_loss:.4f}")

        # Clean up model to free memory
        del model
        torch.cuda.empty_cache()
        gc.collect()

    except Exception as e:
        print(f"  ‚úó {config['name']} failed: {str(e)}")

        # Clean up failed model
        if 'model' in locals():
            del model
        torch.cuda.empty_cache()
        gc.collect()

        continue

# Sort by validation loss (lower is better)
grid_search_results.sort(key=lambda x: x['val_loss'])

# Print final results
print(f"\n{'='*80}")
print("GRID SEARCH RESULTS (sorted by validation loss)")
print(f"{'='*80}")
for i, result in enumerate(grid_search_results, 1):
    config = result['config']
    print(f"{i}. {config['name']}: val_loss={result['val_loss']:.4f} "
          f"(r={config['r']}, lr={config['lr']:.0e}, alpha={config['alpha']}) - "
          f"{result['training_time']/60:.1f} min")

# Select best configuration
if len(grid_search_results) > 0:
    best_config = grid_search_results[0]['config']
    print(f"\n{'='*80}")
    print(f"üèÜ BEST CONFIGURATION: {best_config['name']}")
    print(f"{'='*80}")
    print(f"  LoRA rank (r): {best_config['r']}")
    print(f"  Learning rate: {best_config['lr']}")
    print(f"  LoRA alpha: {best_config['alpha']}")
    print(f"  Validation loss: {grid_search_results[0]['val_loss']:.4f}")
    print(f"  Training time: {grid_search_results[0]['training_time']/60:.1f} minutes")
    print(f"{'='*80}")

    # For compatibility with rest of notebook, create sha_results
    sha_results = grid_search_results
else:
    print("\n‚ö†Ô∏è No configurations completed successfully!")
    best_config = None

print(f"\n{'='*80}")
print("GRID SEARCH COMPLETED")
print(f"{'='*80}")


HYPERPARAMETER GRID SEARCH
Testing 8 configurations

Training each config on 1000 samples for 60 steps


[1/8] Training config_1: r=8, lr=0.0002, alpha=16

  Training config_1: r=8, lr=0.0002, alpha=16
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.11.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 5,636,096 of 1,241,450,496 (0.45% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.2437
20,1.0712
30,0.9816
40,1.0374
50,1.0315
60,1.0671


  Evaluating config_1 on validation set...


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200 [00:00<?, ? examples/s]

Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.14s/it]


  config_1 validation loss: 7.1690, perplexity: 1298.55
  ‚úì config_1 completed in 4.5 minutes - val_loss: 7.1690

[2/8] Training config_2: r=16, lr=0.0002, alpha=16

  Training config_2: r=16, lr=0.0002, alpha=16
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
10,1.2445
20,1.0721
30,0.9819
40,1.0376
50,1.0314
60,1.067


  Evaluating config_2 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.13s/it]


  config_2 validation loss: 7.1890, perplexity: 1324.77
  ‚úì config_2 completed in 4.1 minutes - val_loss: 7.1890

[3/8] Training config_3: r=32, lr=0.0002, alpha=32

  Training config_3: r=32, lr=0.0002, alpha=32
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss
10,1.2321
20,1.0574
30,0.9744
40,1.0281
50,1.0237
60,1.057


  Evaluating config_3 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.13s/it]


  config_3 validation loss: 7.1719, perplexity: 1302.29
  ‚úì config_3 completed in 4.1 minutes - val_loss: 7.1719

[4/8] Training config_4: r=16, lr=0.0001, alpha=16

  Training config_4: r=16, lr=0.0001, alpha=16
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
10,1.2577
20,1.1044
30,1.0042
40,1.0639
50,1.0516
60,1.09


  Evaluating config_4 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.12s/it]


  config_4 validation loss: 7.1789, perplexity: 1311.40
  ‚úì config_4 completed in 3.9 minutes - val_loss: 7.1789

[5/8] Training config_5: r=16, lr=5e-05, alpha=16

  Training config_5: r=16, lr=5e-05, alpha=16
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
10,1.2715
20,1.1356
30,1.0348
40,1.0967
50,1.0815
60,1.1218


  Evaluating config_5 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:55<00:00,  1.12s/it]


  config_5 validation loss: 7.1984, perplexity: 1337.35
  ‚úì config_5 completed in 3.9 minutes - val_loss: 7.1984

[6/8] Training config_6: r=32, lr=0.0001, alpha=32

  Training config_6: r=32, lr=0.0001, alpha=32
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss
10,1.2463
20,1.081
30,0.9878
40,1.046
50,1.0375
60,1.0752


  Evaluating config_6 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.13s/it]


  config_6 validation loss: 7.1846, perplexity: 1318.91
  ‚úì config_6 completed in 3.9 minutes - val_loss: 7.1846

[7/8] Training config_7: r=32, lr=0.0002, alpha=64

  Training config_7: r=32, lr=0.0002, alpha=64
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss
10,1.2203
20,1.0457
30,0.9682
40,1.0228
50,1.0174
60,1.0497


  Evaluating config_7 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.13s/it]


  config_7 validation loss: 7.1608, perplexity: 1287.93
  ‚úì config_7 completed in 3.9 minutes - val_loss: 7.1608

[8/8] Training config_8: r=32, lr=0.0001, alpha=64

  Training config_8: r=32, lr=0.0001, alpha=64
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss
10,1.2337
20,1.0625
30,0.9764
40,1.032
50,1.027
60,1.0626


  Evaluating config_8 on validation set...
Evaluating on 200 samples...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:56<00:00,  1.13s/it]


  config_8 validation loss: 7.1735, perplexity: 1304.40
  ‚úì config_8 completed in 3.9 minutes - val_loss: 7.1735

GRID SEARCH RESULTS (sorted by validation loss)
1. config_7: val_loss=7.1608 (r=32, lr=2e-04, alpha=64) - 3.9 min
2. config_1: val_loss=7.1690 (r=8, lr=2e-04, alpha=16) - 4.5 min
3. config_3: val_loss=7.1719 (r=32, lr=2e-04, alpha=32) - 4.1 min
4. config_8: val_loss=7.1735 (r=32, lr=1e-04, alpha=64) - 3.9 min
5. config_4: val_loss=7.1789 (r=16, lr=1e-04, alpha=16) - 3.9 min
6. config_6: val_loss=7.1846 (r=32, lr=1e-04, alpha=32) - 3.9 min
7. config_2: val_loss=7.1890 (r=16, lr=2e-04, alpha=16) - 4.1 min
8. config_5: val_loss=7.1984 (r=16, lr=5e-05, alpha=16) - 3.9 min

üèÜ BEST CONFIGURATION: config_7
  LoRA rank (r): 32
  Learning rate: 0.0002
  LoRA alpha: 64
  Validation loss: 7.1608
  Training time: 3.9 minutes

GRID SEARCH COMPLETED


## Execute SHA: Successive Halving

This cell runs the complete SHA algorithm. It will take several hours.

**Note:** If Colab disconnects, you can re-run this cell. It will automatically resume from saved checkpoints!

In [None]:
import gc
import torch

# Nuclear option: clear everything from GPU
torch.cuda.empty_cache()
gc.collect()

# Check what's using memory
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

# If still high, try resetting the cache
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()

print(f"After cleanup - GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

GPU memory allocated: 0.00 GB
GPU memory reserved: 0.02 GB
After cleanup - GPU memory allocated: 0.00 GB


Below is the code and the output for our attempt to perform SHA. As you can see if Failed after beginning the second step

In [None]:
'''import time

# Track all results
sha_results = []

# Start with all configurations
active_configs = hyperparameter_configs.copy()

print("\n" + "="*80)
print("STARTING SUCCESSIVE HALVING ALGORITHM (SHA)")
print("="*80)

for round_idx, round_config in enumerate(sha_rounds, 1):
    print(f"\n{'='*80}")
    print(f"SHA ROUND {round_idx}/{len(sha_rounds)}")
    print(f"Training on {round_config['n_samples']} samples for {round_config['steps']} steps")
    print(f"Active configurations: {len(active_configs)}")
    print(f"{'='*80}")

    # Create training subset
    train_subset = train_dataset.select(range(round_config['n_samples']))

    round_results = []

    # Train each active configuration
    for config_idx, config in enumerate(active_configs):
        start_time = time.time()

        try:
            model, val_loss = train_configuration(
                config,
                train_subset,
                round_config['steps'],
                round_idx,
                config_idx
            )

            training_time = time.time() - start_time

            result = {
                'round': round_idx,
                'config': config,
                'val_loss': val_loss,
                'n_samples': round_config['n_samples'],
                'steps': round_config['steps'],
                'training_time': training_time,
            }
            round_results.append(result)

            print(f"  ‚úì {config['name']} completed in {training_time/60:.1f} minutes")

            # Clean up model to free memory
            del model
            torch.cuda.empty_cache()

        except Exception as e:
            print(f"  ‚úó {config['name']} failed: {str(e)}")
            continue

    # Sort by validation loss (lower is better)
    round_results.sort(key=lambda x: x['val_loss'])
    sha_results.extend(round_results)

    # Print round summary
    print(f"\n  Round {round_idx} Results (sorted by validation loss):")
    print(f"  {'-'*70}")
    for i, result in enumerate(round_results, 1):
        config = result['config']
        print(f"  {i}. {config['name']}: val_loss={result['val_loss']:.4f} "
              f"(r={config['r']}, lr={config['lr']}, alpha={config['alpha']})")

    # Keep best half for next round (unless it's the last round)
    if round_idx < len(sha_rounds):
        n_keep = max(1, len(round_results) // 2)
        active_configs = [r['config'] for r in round_results[:n_keep]]

        print(f"\n  ‚û§ Advancing top {n_keep} configuration(s) to next round")
        for config in active_configs:
            print(f"    - {config['name']}")
    else:
        # Last round - select the best
        best_config = round_results[0]['config']
        print(f"\n  üèÜ BEST CONFIGURATION: {best_config['name']}")
        print(f"     r={best_config['r']}, lr={best_config['lr']}, alpha={best_config['alpha']}")
        print(f"     Final validation loss: {round_results[0]['val_loss']:.4f}")

print(f"\n{'='*80}")
print("SHA COMPLETED")
print(f"{'='*80}")'''


STARTING SUCCESSIVE HALVING ALGORITHM (SHA)

SHA ROUND 1/4
Training on 1000 samples for 60 steps
Active configurations: 8

  Training config_1: r=8, lr=0.0002, alpha=16
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.11.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.
Exception ignored in: <function _xla_gc_callback at 0x7bd985ccc180>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/jax/_src/lib/__init__.py", line 127, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1000 [00:00<?, ? examples/s]

Process ForkPoolWorker-4:
Exception ignored in: <function _releaseLock at 0x7bdb0f325760>
Traceback (most recent call last):
  File "/usr/lib/python3.12/logging/__init__.py", line 243, in _releaseLock
    def _releaseLock():
    
KeyboardInterrupt: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/queues.py", line 385, in get
    res = self._reader.recv_bytes()
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/connection.py", line 219, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 

KeyboardInterrupt: 

## Train Best Configuration on Full Training Set

Now we train the winning configuration on the full training data.

In [None]:
# Get best configuration from SHA
best_config = sha_results[-1]['config']  # Last round, best config

print("="*80)
print("TRAINING BEST CONFIGURATION ON FULL TRAINING SET")
print("="*80)
print(f"Best configuration: {best_config['name']}")
print(f"  r={best_config['r']}, lr={best_config['lr']}, alpha={best_config['alpha']}")
print(f"\nTraining on {len(train_dataset)} examples for 1000 steps...\n")

# Load fresh model
optimized_model, _ = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Add LoRA with best hyperparameters
optimized_model = FastLanguageModel.get_peft_model(
    optimized_model,
    r=best_config["r"],
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=best_config["alpha"],
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Training arguments for final training
final_output_dir = os.path.join(CHECKPOINT_DIR, "final_optimized_training")

final_training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=20,
    max_steps=1000,  # Full training
    learning_rate=best_config["lr"],
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=20,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir=final_output_dir,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    report_to="none",
)

# Create trainer
final_trainer = SFTTrainer(
    model=optimized_model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=final_training_args,
)

# Train only on responses
final_trainer = train_on_responses_only(
    final_trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Check for checkpoint to resume
final_checkpoint = None
if os.path.exists(final_output_dir):
    checkpoints = [d for d in os.listdir(final_output_dir) if d.startswith("checkpoint")]
    if checkpoints:
        latest = max(checkpoints, key=lambda x: int(x.split("-")[1]))
        final_checkpoint = os.path.join(final_output_dir, latest)
        print(f"Resuming from checkpoint: {final_checkpoint}\n")

# Train!
start_time = time.time()
final_trainer.train(resume_from_checkpoint=final_checkpoint)
training_time = time.time() - start_time

print(f"\n‚úì Training completed in {training_time/60:.1f} minutes")

TRAINING BEST CONFIGURATION ON FULL TRAINING SET
Best configuration: config_5
  r=16, lr=5e-05, alpha=16

Training on 80000 examples for 1000 steps...

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/80000 [00:00<?, ? examples/s]

Map (num_proc=6):   0%|          | 0/80000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
20,1.1279
40,1.1977
60,1.0951
80,1.0797
100,1.0453
120,1.0231
140,1.0638
160,1.0556
180,1.0377
200,1.0812



‚úì Training completed in 42.3 minutes


## Train Baseline Fine-Tuned Model (No Optimization)

For comparison, we also train a model with default hyperparameters (r=16, lr=2e-4).

In [None]:
print("="*80)
print("TRAINING BASELINE FINE-TUNED MODEL (DEFAULT HYPERPARAMETERS)")
print("="*80)
print("Configuration: r=16, lr=2e-4, alpha=16")
print(f"Training on {len(train_dataset)} examples for 1000 steps...\n")

# Load fresh model
baseline_ft_model, _ = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Add LoRA with default hyperparameters
baseline_ft_model = FastLanguageModel.get_peft_model(
    baseline_ft_model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Training arguments
baseline_output_dir = os.path.join(CHECKPOINT_DIR, "baseline_finetuned")

baseline_training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=20,
    max_steps=1000,
    learning_rate=2e-4,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=20,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir=baseline_output_dir,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    report_to="none",
)

# Create trainer
baseline_trainer = SFTTrainer(
    model=baseline_ft_model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=baseline_training_args,
)

# Train only on responses
baseline_trainer = train_on_responses_only(
    baseline_trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Check for checkpoint
baseline_checkpoint = None
if os.path.exists(baseline_output_dir):
    checkpoints = [d for d in os.listdir(baseline_output_dir) if d.startswith("checkpoint")]
    if checkpoints:
        latest = max(checkpoints, key=lambda x: int(x.split("-")[1]))
        baseline_checkpoint = os.path.join(baseline_output_dir, latest)
        print(f"Resuming from checkpoint: {baseline_checkpoint}\n")

# Train!
start_time = time.time()
baseline_trainer.train(resume_from_checkpoint=baseline_checkpoint)
baseline_training_time = time.time() - start_time

print(f"\n‚úì Baseline training completed in {baseline_training_time/60:.1f} minutes")

TRAINING BASELINE FINE-TUNED MODEL (DEFAULT HYPERPARAMETERS)
Configuration: r=16, lr=2e-4, alpha=16
Training on 80000 examples for 1000 steps...

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 80,000 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
20,1.095
40,1.1438
60,1.0522
80,1.0454
100,1.0145
120,0.995
140,1.0354
160,1.0311
180,1.0184
200,1.0605



‚úì Baseline training completed in 42.3 minutes


## Save Final Models

In [None]:
print("="*80)
print("SAVING FINAL MODELS")
print("="*80)

# Save optimized model (LoRA adapters)
optimized_save_path = os.path.join(MODEL_SAVE_DIR, "optimized_lora_model")
print(f"\nSaving optimized model to: {optimized_save_path}")
optimized_model.save_pretrained(optimized_save_path)
tokenizer.save_pretrained(optimized_save_path)
print("‚úì Optimized model saved")

# Save baseline fine-tuned model (LoRA adapters)
baseline_save_path = os.path.join(MODEL_SAVE_DIR, "baseline_lora_model")
print(f"\nSaving baseline fine-tuned model to: {baseline_save_path}")
baseline_ft_model.save_pretrained(baseline_save_path)
tokenizer.save_pretrained(baseline_save_path)
print("‚úì Baseline fine-tuned model saved")

# Save to GGUF for CPU inference (optimized model)
gguf_save_path = os.path.join(MODEL_SAVE_DIR, "optimized_gguf")
print(f"\nConverting optimized model to GGUF: {gguf_save_path}")
optimized_model.save_pretrained_gguf(gguf_save_path, tokenizer, quantization_method="q4_k_m")
print("‚úì GGUF model saved (use this for Gradio UI)")

print(f"\n{'='*80}")
print("All models saved successfully!")
print(f"{'='*80}")

SAVING FINAL MODELS

Saving optimized model to: /content/drive/MyDrive/lab2_models/optimized_lora_model
‚úì Optimized model saved

Saving baseline fine-tuned model to: /content/drive/MyDrive/lab2_models/baseline_lora_model
‚úì Baseline fine-tuned model saved

Converting optimized model to GGUF: /content/drive/MyDrive/lab2_models/optimized_gguf
Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/889 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:48<00:00, 48.80s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:20<00:00, 80.50s/it]


Unsloth: Merge process complete. Saved to `/content/drive/MyDrive/lab2_models/optimized_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages


## How to Load Saved Models

To use your saved models later (in Gradio UI or another notebook):

In [None]:
# Example: Loading the optimized model from saved LoRA adapters

if False:  # Set to True to run this cell
    from unsloth import FastLanguageModel

    # Load model with LoRA adapters
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=os.path.join(MODEL_SAVE_DIR, "optimized_lora_model"),
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    # Prepare for inference
    FastLanguageModel.for_inference(model)

    print("‚úì Model loaded and ready for inference!")

In [None]:
# Example: Loading from a checkpoint (to resume training)

if False:  # Set to True to run this cell
    from unsloth import FastLanguageModel

    # Find latest checkpoint
    checkpoint_dir = os.path.join(CHECKPOINT_DIR, "final_optimized_training")
    checkpoints = [d for d in os.listdir(checkpoint_dir) if d.startswith("checkpoint")]
    latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1]))
    checkpoint_path = os.path.join(checkpoint_dir, latest_checkpoint)

    print(f"Loading from checkpoint: {checkpoint_path}")

    # Load base model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    # Add LoRA (will load weights from checkpoint during training)
    model = FastLanguageModel.get_peft_model(
        model,
        r=best_config["r"],
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        lora_alpha=best_config["alpha"],
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
    )

    # Resume training from checkpoint
    # trainer.train(resume_from_checkpoint=checkpoint_path)

    print("‚úì Ready to resume training!")

## Summary and Next Steps

### What We Accomplished:

1. ‚úÖ Split dataset into train/val/test (80/10/10)
2. ‚úÖ Implemented Successive Halving Algorithm (SHA) for hyperparameter tuning
3. ‚úÖ Tested multiple configurations (LoRA rank, learning rate, alpha)
4. ‚úÖ Saved checkpoints throughout training (resumable if disconnected)
5. ‚úÖ Trained three models:
   - Base (no fine-tuning)
   - Baseline fine-tuned (default hyperparameters)
   - Optimized fine-tuned (SHA-selected hyperparameters)
6. ‚úÖ Comprehensive evaluation on held-out test set
7. ‚úÖ Saved final models (LoRA adapters + GGUF for CPU inference)

### Next Steps:

1. **Use the GGUF model** in your Gradio UI (located in `MODEL_SAVE_DIR/optimized_gguf`)
2. **Document your findings** in README.md:
   - SHA results (which configs worked best)
   - Final performance improvements
   - Training times
   - Qualitative improvements observed
3. **Deploy to Hugging Face Spaces** with the GGUF model for CPU inference

### Files Saved:

- `{MODEL_SAVE_DIR}/optimized_lora_model/` - Best model (LoRA adapters)
- `{MODEL_SAVE_DIR}/baseline_lora_model/` - Baseline model (LoRA adapters)
- `{MODEL_SAVE_DIR}/optimized_gguf/` - GGUF format for CPU inference (use in Gradio)
- `{CHECKPOINT_DIR}/` - All training checkpoints (can delete after final training)

### Model Performance:

See the evaluation cells above for specific numbers on:
- Perplexity improvements
- Loss reductions
- Qualitative response quality

**Great job! You now have everything needed for Lab 2!** üéâ