<a href="https://colab.research.google.com/github/michael-borck/loco-llm/blob/main/notebooks/train_math_adapter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LocoLLM: Train a Math Adapter on Colab T4

This notebook trains a QLoRA math adapter on **Qwen3-4B** using [Unsloth](https://github.com/unslothai/unsloth), then exports a merged **Q4_K_M GGUF** you can download and run locally via Ollama.

**What you need:** A free Google Colab account with a T4 GPU runtime (16GB VRAM).

**What you get:** A ~2.5GB GGUF file that is a complete, standalone math-specialized model.

**Time:** ~20-30 minutes end to end (including install).

---

## How to use this notebook

1. Open in Colab (click the badge above or use File > Open in Colab)
2. Go to **Runtime > Change runtime type** and select **T4 GPU**
3. Run all cells in order (**Runtime > Run all**)
4. Download the GGUF file from the last cell
5. Load it into Ollama locally: `ollama create locollm-math -f Modelfile`

## Step 0: Verify GPU

Make sure you have a T4 (or better) GPU assigned.

In [None]:
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

## Step 1: Install Unsloth

This takes 2-3 minutes. Unsloth pulls in the right versions of transformers, peft, trl, etc.

In [None]:
!pip install --upgrade --no-cache-dir unsloth unsloth_zoo 2>&1 | tail -5

## Step 2: Prepare Training Data (GSM8K)

We download 200 examples from [GSM8K](https://huggingface.co/datasets/openai/gsm8k) and format them for Qwen3's chat template. Each example has step-by-step reasoning ending with "The answer is N".

200 examples is enough for a proof-of-concept. For a production adapter, use 500-1000+.

In [None]:
import re

import requests

NUM_EXAMPLES = 200

# Download from HuggingFace datasets API
url = f"https://datasets-server.huggingface.co/rows?dataset=openai/gsm8k&config=main&split=train&offset=0&length={NUM_EXAMPLES}"
resp = requests.get(url, timeout=60)
resp.raise_for_status()
rows = [r["row"] for r in resp.json()["rows"]]
print(f"Downloaded {len(rows)} examples from GSM8K")


def format_answer(answer_text: str) -> str:
    """Convert GSM8K answer format to clean step-by-step reasoning."""
    parts = answer_text.split("####")
    reasoning = re.sub(r"<<.*?>>", "", parts[0]).strip()
    final_answer = parts[1].strip() if len(parts) > 1 else ""
    return f"{reasoning}\nThe answer is {final_answer}"


# Format for Qwen3 chat template
training_data = []
for ex in rows:
    training_data.append(
        {
            "conversations": [
                {"role": "user", "content": ex["question"].strip()},
                {"role": "assistant", "content": format_answer(ex["answer"])},
            ]
        }
    )

print(f"Formatted {len(training_data)} training examples")
print(f"\nSample question: {training_data[0]['conversations'][0]['content'][:100]}...")
print(f"Sample answer:   {training_data[0]['conversations'][1]['content'][:100]}...")

## Step 3: Load Qwen3-4B with Unsloth

Unsloth loads the model in 4-bit quantization. On a T4 this uses ~5GB VRAM, leaving plenty of room for training.

In [None]:
from unsloth import FastModel

MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit"
MAX_SEQ_LENGTH = 1024

model, tokenizer = FastModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=True,
    full_finetuning=False,
)

print(f"Model loaded: {MODEL_NAME}")

## Step 4: Apply LoRA Adapters

We apply LoRA to the attention layers (q/k/v/o projections). Rank 16 is a good middle ground for math reasoning — enough capacity to learn patterns without overfitting on 200 examples.

In [None]:
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

model = FastModel.get_peft_model(
    model,
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")

## Step 5: Prepare Dataset

Convert our training data to a HuggingFace Dataset and apply the chat template.

In [None]:
from datasets import Dataset

dataset = Dataset.from_list(training_data)


def format_conversations(examples):
    """Apply Qwen3 chat template to conversations."""
    texts = []
    for convos in examples["conversations"]:
        text = tokenizer.apply_chat_template(
            convos,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}


dataset = dataset.map(format_conversations, batched=True)
print(f"Dataset ready: {len(dataset)} examples")
print(f"\nFormatted sample (first 200 chars):\n{dataset[0]['text'][:200]}...")

## Step 6: Train

Fine-tune for 3 epochs with SFTTrainer. On a T4, this takes ~10-15 minutes for 200 examples.

Watch the training loss — it should decrease steadily. If it plateaus early, the model has learned what it can from this data.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

NUM_EPOCHS = 3
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 4  # effective batch size = 8
LEARNING_RATE = 2e-4

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./checkpoints",
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        fp16=True,
        logging_steps=5,
        save_strategy="epoch",
        seed=42,
        optim="adamw_8bit",
        report_to="none",
    ),
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=False,
)

print(f"Training: {NUM_EPOCHS} epochs, effective batch size {BATCH_SIZE * GRADIENT_ACCUMULATION}")
trainer.train()

## Step 7: Quick Sanity Check

Before exporting, test the fine-tuned model on a couple of math problems to make sure it's producing reasonable output.

In [None]:
from unsloth import FastModel

# Switch to inference mode
FastModel.for_inference(model)

test_questions = [
    "What is 15 + 27?",
    (
        "A store has 120 apples. They sell 45 in the morning and 30 in the"
        " afternoon. How many apples are left?"
    ),
    "If a shirt costs $40 and is on sale for 25% off, what is the sale price?",
]

for q in test_questions:
    messages = [{"role": "user", "content": q}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    outputs = model.generate(inputs, max_new_tokens=256, temperature=0.0, use_cache=True)
    response = tokenizer.decode(outputs[0][inputs.shape[-1] :], skip_special_tokens=True)
    print(f"Q: {q}")
    print(f"A: {response}")
    print("-" * 60)

## Step 8: Export Merged GGUF

Merge the LoRA weights into the base model and export as a Q4_K_M GGUF. This creates a standalone model file (~2.5GB) that Ollama can load directly.

**Why merge?** Ollama doesn't support Qwen3 LoRA adapters via the `ADAPTER` directive (only Llama/Mistral/Gemma). Merging produces a complete, self-contained model.

In [None]:
import os

OUTPUT_DIR = "./gguf_output"
QUANTIZATION_METHOD = "q4_k_m"

os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Exporting merged GGUF ({QUANTIZATION_METHOD})...")
print("This takes a few minutes — merging weights and quantizing.")

model.save_pretrained_gguf(
    OUTPUT_DIR,
    tokenizer,
    quantization_method=QUANTIZATION_METHOD,
)

# Show the exported file
for f in os.listdir(OUTPUT_DIR):
    if f.endswith(".gguf"):
        size_mb = os.path.getsize(os.path.join(OUTPUT_DIR, f)) / (1024 * 1024)
        print(f"\nExported: {f} ({size_mb:.0f} MB)")

## Step 9: Save to Google Drive (Optional)

Save the GGUF to your Google Drive so it persists after the Colab session ends. You can also download it directly from the next cell.

In [None]:
import shutil

from google.colab import drive

drive.mount("/content/drive")

DRIVE_DIR = "/content/drive/MyDrive/locollm-adapters/math"
os.makedirs(DRIVE_DIR, exist_ok=True)

for f in os.listdir(OUTPUT_DIR):
    if f.endswith(".gguf"):
        src = os.path.join(OUTPUT_DIR, f)
        dst = os.path.join(DRIVE_DIR, f)
        print(f"Copying {f} to Google Drive...")
        shutil.copy2(src, dst)
        print(f"Saved to: {dst}")

## Step 10: Download the GGUF

Download the GGUF file to your local machine. Then load it into Ollama:

```bash
# On your local machine:
cd loco-llm
mkdir -p adapters/math/gguf
mv ~/Downloads/unsloth.Q4_K_M.gguf adapters/math/gguf/

# Create Ollama model
echo 'FROM ./adapters/math/gguf/unsloth.Q4_K_M.gguf' > adapters/math/Modelfile
ollama create locollm-math -f adapters/math/Modelfile

# Test it
ollama run locollm-math "What is 15 + 27?"

# Or use the LocoLLM CLI
uv run loco setup
uv run loco eval math
```

In [None]:
from google.colab import files

for f in os.listdir(OUTPUT_DIR):
    if f.endswith(".gguf"):
        print(f"Downloading {f}...")
        files.download(os.path.join(OUTPUT_DIR, f))

---

## Training Summary

| Parameter | Value |
|-----------|-------|
| Base model | Qwen3-4B (4-bit via Unsloth) |
| Method | QLoRA |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Dataset | GSM8K (200 examples) |
| Epochs | 3 |
| Effective batch size | 8 |
| Learning rate | 2e-4 |
| Export format | Q4_K_M GGUF (~2.5GB) |
| Hardware | Colab T4 (16GB VRAM) |

## What Next?

- **Evaluate**: Run `uv run loco eval math` to compare the adapter against the base model
- **More data**: Try 500 or 1000 GSM8K examples (change `NUM_EXAMPLES` in Step 2)
- **Different domain**: Fork this notebook and swap GSM8K for your own dataset
- **Iterate**: See [adapter-guide.md](https://github.com/michael-borck/loco-llm/blob/main/docs/adapter-guide.md) for the full development cycle