# Notebook 1: Full Fine-Tuning with SmolLM2-135M

This notebook demonstrates full parameter fine-tuning using Unsloth.ai with a small model (SmolLM2-135M).

## Key Concepts
- **Full Fine-Tuning**: Updates all model parameters (vs LoRA which only updates adapters)
- **Model**: SmolLM2-135M - A tiny but capable model perfect for learning
- **Task**: Instruction following / Chat completion
- **Dataset**: Alpaca-style instruction dataset

## Video Recording Checklist
- [ ] Explain what full fine-tuning means
- [ ] Show model architecture and parameter count
- [ ] Walk through dataset format
- [ ] Explain training hyperparameters
- [ ] Show training progress and metrics
- [ ] Demonstrate inference before/after
- [ ] Export to Ollama

## Step 1: Install Unsloth

In [None]:
%%capture
# Install Unsloth and dependencies
!pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Load Model with Full Fine-Tuning Configuration

In [None]:
# Model configuration
max_seq_length = 2048  # SmolLM2 can handle up to 2048 tokens
dtype = None  # Auto-detect. Use Float16 for Tesla T4, V100, or bfloat16 for Ampere+
load_in_4bit = False  # We want full precision for full fine-tuning

# Load SmolLM2-135M model
# Alternative: "unsloth/gemma-3-1b-it-unsloth-bnb-4bit" for slightly larger model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/SmolLM2-135M-Instruct",  # 135M parameters - very small!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 4: Configure for Full Fine-Tuning (NOT LoRA)

**Important**: Setting `use_gradient_checkpointing="unsloth"` with no LoRA modules means full fine-tuning!

In [None]:
# Configure model for full fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r = 0,  # No LoRA rank - this means full fine-tuning!
    target_modules = None,  # No specific target modules
    lora_alpha = 0,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Memory efficient
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

print("Model configured for FULL FINE-TUNING")
print("All parameters will be updated during training!")

## Step 5: Load and Prepare Dataset

We'll use a small instruction-following dataset. Format:
```
{
  "instruction": "What is the capital of France?",
  "input": "",
  "output": "The capital of France is Paris."
}
```

In [None]:
# Load Alpaca dataset (cleaned version with 52k instructions)
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Let's look at a few examples
print("Dataset size:", len(dataset))
print("\nFirst example:")
print(dataset[0])
print("\nDataset columns:", dataset.column_names)

## Step 6: Create Chat Template

We need to format the data according to SmolLM2's chat template

In [None]:
# SmolLM2 uses a simple chat template
chat_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # End of sequence token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Combine instruction and input if input exists
        text = chat_template.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    
    return {"text": texts}

# Apply formatting to dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

# Show a formatted example
print("Formatted example:")
print(dataset[0]["text"])

## Step 7: Configure Training Arguments

For full fine-tuning, we use smaller learning rates than LoRA

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size = 4,  # Batch size per GPU
    gradient_accumulation_steps = 4,  # Effective batch size = 4 * 4 = 16
    warmup_steps = 100,
    num_train_epochs = 1,  # 1 epoch for demo, increase for better results
    max_steps = 500,  # Limit steps for faster training
    learning_rate = 5e-5,  # Lower LR for full fine-tuning vs 2e-4 for LoRA
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 10,
    optim = "adamw_8bit",  # Memory efficient optimizer
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs/smollm2_full_finetuned",
    report_to = "none",  # Disable wandb/tensorboard for now
)

print("Training configuration:")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Total steps: {training_args.max_steps}")

## Step 8: Create Trainer and Start Training

In [None]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,  # Can make training 5x faster for short sequences
    args = training_args,
)

print("Trainer created. Starting training...")
print("\n" + "="*50)
print("TRAINING IN PROGRESS")
print("="*50 + "\n")

In [None]:
# Start training
trainer_stats = trainer.train()

print("\n" + "="*50)
print("TRAINING COMPLETE!")
print("="*50)
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Loss: {trainer_stats.metrics['train_loss']:.4f}")

## Step 9: Test Inference

Let's test the fine-tuned model!

In [None]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "What is the capital of Japan?",
    "Write a Python function to calculate factorial",
    "Explain machine learning in simple terms"
]

for prompt in test_prompts:
    formatted_prompt = chat_template.format(prompt, "", "")
    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        use_cache=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("\n" + "="*70)
    print(f"PROMPT: {prompt}")
    print("-"*70)
    print(f"RESPONSE:\n{response}")
    print("="*70)

## Step 10: Save Model

In [None]:
# Save the full fine-tuned model
model.save_pretrained("smollm2_full_finetuned")
tokenizer.save_pretrained("smollm2_full_finetuned")

print("Model saved to: smollm2_full_finetuned/")

## Step 11: Export to Different Formats

In [None]:
# Export to GGUF format for llama.cpp
model.save_pretrained_gguf(
    "smollm2_full_finetuned_gguf",
    tokenizer,
    quantization_method = "q4_k_m"  # 4-bit quantization
)

print("Exported to GGUF format!")

# Export to float16 for Ollama
model.save_pretrained_gguf(
    "smollm2_full_finetuned_ollama",
    tokenizer,
    quantization_method = "f16"  # Float16 for Ollama
)

print("Exported to Ollama-compatible format!")

## Step 12: Upload to HuggingFace (Optional)

In [None]:
# Uncomment and fill in your HuggingFace username
# model.push_to_hub("your_username/smollm2-135m-alpaca-full-finetuned", token="YOUR_HF_TOKEN")
# tokenizer.push_to_hub("your_username/smollm2-135m-alpaca-full-finetuned", token="YOUR_HF_TOKEN")

print("To upload to HuggingFace, uncomment the code above and add your token")

## Summary

### What we accomplished:
1. Loaded SmolLM2-135M model (135 million parameters)
2. Configured for **full fine-tuning** (all parameters updated)
3. Fine-tuned on Alpaca instruction dataset (52k examples)
4. Tested inference with custom prompts
5. Exported to multiple formats (GGUF, Ollama)

### Key Differences from LoRA:
- **Full Fine-Tuning**: Updates ALL model parameters (135M)
- **LoRA**: Only updates adapter parameters (~1-2M)
- **Memory**: Full FT requires more VRAM
- **Speed**: LoRA is faster to train
- **Quality**: Full FT can achieve better results but risks overfitting

### Next Steps:
- Compare with LoRA results in Notebook 2
- Try with larger models (Gemma-3-1B)
- Experiment with different learning rates
- Test on domain-specific datasets