# Colab 2: LoRA Fine-tuning with SmolLM2-135M

## Overview
This notebook demonstrates **LoRA (Low-Rank Adaptation)** fine-tuning using Unsloth with SmolLM2-135M.

### What is LoRA?
- **Parameter Efficient**: Only trains 1-2% of model parameters
- **Adapter Layers**: Adds small trainable matrices to frozen model
- **Memory Efficient**: Uses much less memory than full fine-tuning
- **Fast**: Trains faster with fewer parameters to update

### LoRA vs Full Fine-tuning:
| Aspect | Full Fine-tuning | LoRA |
|--------|-----------------|------|
| Parameters Updated | 100% (135M) | ~1-2% (~2M) |
| Memory Usage | High | Low |
| Training Speed | Slower | Faster |
| Deployment | Full model | Base model + adapter |

### Same Dataset as Colab 1
We'll use the same Alpaca dataset for direct comparison.

In [1]:
# Install Unsloth
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
# Import libraries
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


## Step 1: Load SmolLM2-135M Model

Same model as Colab 1, but we'll configure it differently for LoRA.

In [3]:
# Model configuration
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/SmolLM2-135M-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"‚úì Base model loaded: {model.__class__.__name__}")
print(f"‚úì Total parameters: ~135M")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/423 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

‚úì Base model loaded: LlamaForCausalLM
‚úì Total parameters: ~135M


## Step 2: Configure LoRA

### LoRA Parameters Explained:
- **r (rank)**: Size of LoRA matrices (16 = 16x16)
  - Higher r = more capacity but more parameters
  - Typical values: 8, 16, 32, 64
  
- **lora_alpha**: Scaling factor (typically 2x rank)
  - Controls impact of LoRA updates
  
- **lora_dropout**: Dropout for regularization
  - Prevents overfitting
  
- **target_modules**: Which layers to adapt
  - q_proj, k_proj, v_proj = attention layers
  - o_proj = output projection
  - gate_proj, up_proj, down_proj = MLP layers

In [4]:
# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (size of adapter matrices)
    target_modules=[
        "q_proj",    # Query projection
        "k_proj",    # Key projection
        "v_proj",    # Value projection
        "o_proj",    # Output projection
        "gate_proj", # MLP gate
        "up_proj",   # MLP up
        "down_proj", # MLP down
    ],
    lora_alpha=16,  # Scaling factor (typically = r)
    lora_dropout=0,  # No dropout for small models
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory efficient
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
trainable_percent = 100 * trainable_params / total_params

print("\n" + "="*60)
print("LoRA Configuration:")
print("="*60)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable %: {trainable_percent:.2f}%")
print(f"\nMemory savings: ~{100-trainable_percent:.1f}% compared to full fine-tuning")
print("="*60)

Unsloth 2025.11.2 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.



LoRA Configuration:
Total parameters: 86,315,904
Trainable parameters: 4,884,480
Trainable %: 5.66%

Memory savings: ~94.3% compared to full fine-tuning


## Step 3: Load Same Dataset

Using identical dataset as Colab 1 for fair comparison.

In [5]:
# Load Alpaca dataset (100 samples)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:100]")

print(f"Dataset: {len(dataset)} examples")
print("\nSample:")
print(f"Instruction: {dataset[0]['instruction']}")
print(f"Output: {dataset[0]['output'][:100]}...")

README.md: 0.00B [00:00, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset: 100 examples

Sample:
Instruction: Give three tips for staying healthy.
Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and...


In [6]:
# Same chat template as Colab 1
chat_template = """<|im_start|>user
{}<|im_end|>
<|im_start|>assistant
{}<|im_end|>"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for instruction, input_text, output in zip(instructions, inputs, outputs):
        user_message = instruction
        if input_text:
            user_message += f"\n\n{input_text}"

        text = chat_template.format(user_message, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

print("\n‚úì Dataset formatted with chat template")

Map:   0%|          | 0/100 [00:00<?, ? examples/s]


‚úì Dataset formatted with chat template


## Step 4: Train with LoRA

Training is faster than full fine-tuning because:
- Fewer parameters to update (~2M vs 135M)
- Less memory needed
- Smaller gradients to compute

In [7]:
# Training configuration
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=10,  # Quick demo
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs_lora",
    ),
)

print("\nüöÄ Starting LoRA training...\n")

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/100 [00:00<?, ? examples/s]


üöÄ Starting LoRA training...



In [8]:
# Train
trainer_stats = trainer.train()

print("\n" + "="*60)
print("‚úì LoRA Training Completed!")
print("="*60)
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Steps: {trainer_stats.global_step}")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print("="*60)

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 4,884,480 of 139,400,064 (3.50% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkalharpatel10[0m ([33mkalharpatel10-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss
1,1.989
2,2.0233
3,1.8221
4,1.978
5,1.6593
6,1.8458
7,1.934
8,1.6946
9,2.1092
10,2.0364



‚úì LoRA Training Completed!
Final loss: 1.9092
Steps: 10
Training time: 55.13s


## Step 5: Test LoRA Model

The base model + LoRA adapters = fine-tuned model

In [9]:
# Enable fast inference
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "What is Python?",
    "Write a function to calculate factorial.",
    "Explain LoRA in simple terms."
]

print("Testing LoRA fine-tuned model:\n")
print("="*60)

for prompt in test_prompts:
    formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    assistant_response = response.split('<|im_start|>assistant')[-1].split('<|im_end|>')[0].strip()

    print(f"\nüìù Prompt: {prompt}")
    print(f"ü§ñ Response: {assistant_response}")
    print("-"*60)

Testing LoRA fine-tuned model:


üìù Prompt: What is Python?
ü§ñ Response: Python is a popular programming language that is used for developing applications and data analysis, machine learning, web development, and more. It's one of the most widely used programming languages in the world, and it's used by a wide range of industries and sectors, including but not limited to:

- Web development: It's used to create websites, web applications, and other web content, and to develop applications and tools for various industries.
- Data analysis: It's used to analyze and
------------------------------------------------------------

üìù Prompt: Write a function to calculate factorial.
ü§ñ Response: To calculate factorial, we can use the recursive function in Python. This function takes an integer `n` as input and returns the result of `n` factorial using recursion.

```python
def factorial(n):
    # Base case
    if n == 0:
        return 1
    # Recursive case
    elif n == 1:
        re

## Step 6: Save LoRA Adapters

LoRA adapters are small (few MB) compared to full model (hundreds of MB).

In [10]:
# Save LoRA adapters only
model.save_pretrained("smollm2_135m_lora")
tokenizer.save_pretrained("smollm2_135m_lora")

print("‚úì LoRA adapters saved to: smollm2_135m_lora/")
print("\nTo use later:")
print("1. Load base model")
print("2. Load LoRA adapters from this folder")
print("3. Merge or use together")

# Check file sizes
import os
adapter_size = sum(
    os.path.getsize(os.path.join("smollm2_135m_lora", f))
    for f in os.listdir("smollm2_135m_lora")
    if os.path.isfile(os.path.join("smollm2_135m_lora", f))
) / (1024 * 1024)  # Convert to MB

print(f"\nAdapter size: ~{adapter_size:.1f} MB")
print(f"Full model would be: ~500 MB")
print(f"Size savings: ~{(1 - adapter_size/500)*100:.1f}%")

‚úì LoRA adapters saved to: smollm2_135m_lora/

To use later:
1. Load base model
2. Load LoRA adapters from this folder
3. Merge or use together

Adapter size: ~23.3 MB
Full model would be: ~500 MB
Size savings: ~95.3%


## Step 7: Merge LoRA to Base Model (Optional)

You can merge LoRA adapters into the base model for deployment.

In [11]:
# Merge LoRA weights into base model
model_merged = model.merge_and_unload()

# Save merged model
model_merged.save_pretrained("smollm2_135m_merged")
tokenizer.save_pretrained("smollm2_135m_merged")

print("‚úì Merged model saved to: smollm2_135m_merged/")
print("\nThis is a standalone model (no adapters needed)")



‚úì Merged model saved to: smollm2_135m_merged/

This is a standalone model (no adapters needed)


## Summary: LoRA vs Full Fine-tuning

### Comparison with Colab 1:

| Metric | Full Fine-tuning | LoRA |
|--------|------------------|------|
| Trainable Params | ~135M (100%) | ~2M (1-2%) |
| Memory Usage | High | Low |
| Training Speed | Slower | Faster |
| Model Size | 500 MB | Base + 5-10 MB |
| Quality | Excellent | Very Good |
| Use Case | Small models | Large models |

### LoRA Advantages:
‚úÖ **Memory Efficient**: Train large models on consumer GPUs  
‚úÖ **Fast Training**: Update fewer parameters  
‚úÖ **Small Adapters**: Easy to store/share multiple versions  
‚úÖ **Modular**: Swap adapters for different tasks  

### LoRA Hyperparameters:
```python
r=16              # Rank (8, 16, 32, 64)
lora_alpha=16     # Scaling (typically = r)
lora_dropout=0    # Dropout rate
target_modules=[  # Which layers to adapt
    "q_proj", "k_proj", "v_proj",  # Attention
    "o_proj",                       # Output
    "gate_proj", "up_proj", "down_proj"  # MLP
]
```

### Chat Template:
```
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
{answer}<|im_end|>
```

### When to Use LoRA:
- üéØ Training large models (7B+)
- üíæ Limited GPU memory
- ‚ö° Need fast iteration
- üì¶ Multiple task-specific versions

### When to Use Full Fine-tuning:
- üî¨ Small models (< 1B)
- üí™ Plenty of compute
- üéØ Maximum quality needed
- üìä Large datasets available