# Notebook 1: Full Fine-Tuning with SmolLM2-135M

This notebook demonstrates full parameter fine-tuning using Unsloth.ai with a small model (SmolLM2-135M).

## Key Concepts
- **Full Fine-Tuning**: Updates all model parameters (vs LoRA which only updates adapters)
- **Model**: SmolLM2-135M - A tiny but capable model perfect for learning
- **Task**: Instruction following / Chat completion
- **Dataset**: Alpaca-style instruction dataset

## Video Recording Checklist
- [ ] Explain what full fine-tuning means
- [ ] Show model architecture and parameter count
- [ ] Walk through dataset format
- [ ] Explain training hyperparameters
- [ ] Show training progress and metrics
- [ ] Demonstrate inference before/after
- [ ] Export to Ollama

## Step 1: Install Unsloth

In [1]:
# Install Unsloth and dependencies
!pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-8qw9u93r/unsloth_389f6492d0894aabbcfb1f9906daaa27
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-8qw9u93r/unsloth_389f6492d0894aabbcfb1f9906daaa27
  Resolved https://github.com/unslothai/unsloth.git to commit 1c0ad844f170f67c7cdf6f7a9465bafb0f9627df
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.11.3 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.11.3-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

## Step 2: Import Libraries

In [2]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Libraries imported successfully!
PyTorch version: 2.8.0+cu126
CUDA available: True
GPU: Tesla T4


## Step 3: Load Model with Full Fine-Tuning Configuration

In [3]:
# Model configuration
max_seq_length = 2048  # SmolLM2 can handle up to 2048 tokens
dtype = None  # Auto-detect. Use Float16 for Tesla T4, V100, or bfloat16 for Ampere+
load_in_4bit = False  # We want full precision for full fine-tuning

# Load SmolLM2-135M model
# Alternative: "unsloth/gemma-3-1b-it-unsloth-bnb-4bit" for slightly larger model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/SmolLM2-135M-Instruct",  # 135M parameters - very small!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/423 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Model loaded: unsloth/SmolLM2-135M-Instruct
Model parameters: 134,515,584
Tokenizer vocab size: 49153


## Step 4: Configure for Full Fine-Tuning (NOT LoRA)

**Important**: Setting `use_gradient_checkpointing="unsloth"` with no LoRA modules means full fine-tuning!

In [4]:
# Configure model for full fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r = 256,  # High rank for near-full fine-tuning
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 0,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Memory efficient
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

print("Model configured for FULL FINE-TUNING")
print("All parameters will be updated during training!")

Unsloth 2025.11.2 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.


Model configured for FULL FINE-TUNING
All parameters will be updated during training!


## Step 5: Load and Prepare Dataset

We'll use a small instruction-following dataset. Format:
```
{
  "instruction": "What is the capital of France?",
  "input": "",
  "output": "The capital of France is Paris."
}
```

In [5]:
# Load Alpaca dataset (cleaned version with 52k instructions)
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Let's look at a few examples
print("Dataset size:", len(dataset))
print("\nFirst example:")
print(dataset[0])
print("\nDataset columns:", dataset.column_names)

README.md: 0.00B [00:00, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset size: 51760

First example:
{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.', 'input': '', 'instruction': 'Give three tips for staying healthy.'}

Dataset columns: ['output', 'input', 'instruction']


## Step 6: Create Chat Template

We need to format the data according to SmolLM2's chat template

In [6]:
# SmolLM2 uses a simple chat template
chat_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # End of sequence token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Combine instruction and input if input exists
        text = chat_template.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Apply formatting to dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

# Show a formatted example
print("Formatted example:")
print(dataset[0]["text"])

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Formatted example:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sle

## Step 7: Configure Training Arguments

For full fine-tuning, we use smaller learning rates than LoRA

In [7]:
training_args = TrainingArguments(
    per_device_train_batch_size = 4,  # Batch size per GPU
    gradient_accumulation_steps = 4,  # Effective batch size = 4 * 4 = 16
    warmup_steps = 100,
    num_train_epochs = 1,  # 1 epoch for demo, increase for better results
    max_steps = 500,  # Limit steps for faster training
    learning_rate = 5e-5,  # Lower LR for full fine-tuning vs 2e-4 for LoRA
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 10,
    optim = "adamw_8bit",  # Memory efficient optimizer
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs/smollm2_full_finetuned",
    report_to = "none",  # Disable wandb/tensorboard for now
)

print("Training configuration:")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Total steps: {training_args.max_steps}")

Training configuration:
Effective batch size: 16
Learning rate: 5e-05
Total steps: 500


## Step 8: Create Trainer and Start Training

In [8]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,  # Can make training 5x faster for short sequences
    args = training_args,
)

print("Trainer created. Starting training...")
print("\n" + "="*50)
print("TRAINING IN PROGRESS")
print("="*50 + "\n")

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/51760 [00:00<?, ? examples/s]

Trainer created. Starting training...

TRAINING IN PROGRESS



In [9]:
# Start training
trainer_stats = trainer.train()

print("\n" + "="*50)
print("TRAINING COMPLETE!")
print("="*50)
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Loss: {trainer_stats.metrics['train_loss']:.4f}")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1 | Total steps = 500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 78,151,680 of 212,667,264 (36.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.004
20,2.0074
30,1.9619
40,1.9496
50,2.0324
60,1.9693
70,1.9595
80,2.052
90,2.0274
100,1.9354



TRAINING COMPLETE!
Training time: 710.29 seconds
Loss: 1.9846


## Step 9: Test Inference

Let's test the fine-tuned model!

In [14]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "What is the capital of Japan?",
    "Write a Python function to calculate factorial",
    "Explain machine learning in simple terms"
]

for prompt in test_prompts:
    formatted_prompt = chat_template.format(prompt, "", "")
    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        use_cache=True
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("\n" + "="*70)
    print(f"PROMPT: {prompt}")
    print("-"*70)
    print(f"RESPONSE:\n{response}")
    print("="*70)


PROMPT: What is the capital of Japan?
----------------------------------------------------------------------
RESPONSE:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is the capital of Japan?

### Input:


### Response:
The capital of Japan is Tokyo.

### Explanation:

PROMPT: Write a Python function to calculate factorial
----------------------------------------------------------------------
RESPONSE:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Write a Python function to calculate factorial

### Input:


### Response:
```python
def factorial(n):
    # Your code here
```

### Explanation:
The `factorial` function takes an integer `n` as input and returns its factorial. The factorial of a number `n` is the product of all posi

## Step 10: Save Model

In [15]:
# Save the full fine-tuned model
model.save_pretrained("smollm2_full_finetuned")
tokenizer.save_pretrained("smollm2_full_finetuned")

print("Model saved to: smollm2_full_finetuned/")

Model saved to: smollm2_full_finetuned/


## Step 11: Export to Different Formats

In [16]:
# Export to GGUF format for llama.cpp
model.save_pretrained_gguf(
    "smollm2_full_finetuned_gguf",
    tokenizer,
    quantization_method = "q4_k_m"  # 4-bit quantization
)

print("Exported to GGUF format!")

# Export to float16 for Ollama
model.save_pretrained_gguf(
    "smollm2_full_finetuned_ollama",
    tokenizer,
    quantization_method = "f16"  # Float16 for Ollama
)

print("Exported to Ollama-compatible format!")

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `smollm2_full_finetuned_gguf`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  3.57it/s]


Successfully copied all 1 files from cache to `smollm2_full_finetuned_gguf`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 9664.29it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.21s/it]


Unsloth: Merge process complete. Saved to `/content/smollm2_full_finetuned_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['SmolLM2-135M-Instruct.F16.gguf']
Unsloth: [2] Converting GGUF f16 into q4_k_m. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['SmolLM2-135M-Instruct.Q4_K_M.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Sm

Unsloth: Copying 1 files from cache to `smollm2_full_finetuned_ollama`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  3.28it/s]


Successfully copied all 1 files from cache to `smollm2_full_finetuned_ollama`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10131.17it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.61s/it]


Unsloth: Merge process complete. Saved to `/content/smollm2_full_finetuned_ollama`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['SmolLM2-135M-Instruct.F16.gguf']
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['SmolLM2-135M-Instruct.F16.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/SmolLM2-135M-Instruct'. Skipping Ollama Modelfile
Unsloth: example usage for text 

## Step 12: Upload to HuggingFace (Optional)

In [13]:
# Uncomment and fill in your HuggingFace username
# model.push_to_hub("your_username/smollm2-135m-alpaca-full-finetuned", token="YOUR_HF_TOKEN")
# tokenizer.push_to_hub("your_username/smollm2-135m-alpaca-full-finetuned", token="YOUR_HF_TOKEN")

print("To upload to HuggingFace, uncomment the code above and add your token")

To upload to HuggingFace, uncomment the code above and add your token


## Summary

### What we accomplished:
1. Loaded SmolLM2-135M model (135 million parameters)
2. Configured for **full fine-tuning** (all parameters updated)
3. Fine-tuned on Alpaca instruction dataset (52k examples)
4. Tested inference with custom prompts
5. Exported to multiple formats (GGUF, Ollama)

### Key Differences from LoRA:
- **Full Fine-Tuning**: Updates ALL model parameters (135M)
- **LoRA**: Only updates adapter parameters (~1-2M)
- **Memory**: Full FT requires more VRAM
- **Speed**: LoRA is faster to train
- **Quality**: Full FT can achieve better results but risks overfitting

### Next Steps:
- Compare with LoRA results in Notebook 2
- Try with larger models (Gemma-3-1B)
- Experiment with different learning rates
- Test on domain-specific datasets