# KMP Planner Fine-Tune → GGUF Export
**Model:** `unsloth/Qwen3-1.7B-unsloth-bnb-4bit`  
**Target:** CPU inference via GGUF Q4_K_M  
**Runtime:** Google Colab T4 GPU

---
### Workflow
1. Install dependencies
2. Check token lengths across dataset
3. Load model + tokenizer
4. Prepare & format dataset
5. Train with SFTTrainer
6. Save LoRA adapter checkpoint
7. Merge adapter → 16bit (required for clean GGUF conversion)
8. Export GGUF Q4_K_M for CPU inference
9. Quick sanity-check inference

In [None]:
import os
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

SAVE_DIR = '/content/drive/MyDrive/kmp-planner'
os.makedirs(SAVE_DIR, exist_ok=True)
print(f'✅ Save directory: {SAVE_DIR}')

✅ Save directory: /content/drive/MyDrive/kmp-planner


## 1. Install Dependencies

In [None]:
%%capture
# Unsloth install for Colab T4
!pip install unsloth
!pip install --upgrade --no-cache-dir unsloth unsloth-zoo
# TRL for SFTTrainer
!pip install trl datasets

## 2. Load Your JSONL Dataset

In [None]:
import json


JSONL_PATH = f"{SAVE_DIR}/planner_training_data.jsonl"

raw_data = []
with open(JSONL_PATH, "r") as f:
    for line in f:
        line = line.strip()
        if line:
            raw_data.append(json.loads(line))

print(f"Loaded {len(raw_data)} samples")
print("Keys in first sample:", raw_data[0].keys())

# Quick sanity check: confirm expected keys exist
assert all('instruction' in d and 'output' in d for d in raw_data), \
    "All samples must have 'instruction' and 'output' keys"

Loaded 155 samples
Keys in first sample: dict_keys(['instruction', 'output'])


## 3. Token Length Analysis
Critical check before training — silent truncation is hard to debug.

In [None]:
from unsloth import FastLanguageModel
import torch
from transformers import AutoTokenizer
import numpy as np

MAX_SEQ_LENGTH = 8192  # our training cap

SYSTEM_MSG = """/no_think You are a Kotlin Multiplatform project planner. Given a user request,
create a step-by-step plan as a JSON array.
Each step: action, path, source_set, description, depends_on.
Order: gradle → common expect → platform actuals → tests.
Respond with ONLY valid JSON."""

# Use the tokenizer standalone just for length analysis
tok_check = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

def estimate_token_length(item):
    # Rough estimate using system + instruction + output
    full_text = SYSTEM_MSG + "\n" + item["instruction"] + "\n" + item["output"]
    return len(tok_check.encode(full_text))

lengths = [estimate_token_length(d) for d in raw_data]

print(f"Token length stats across {len(lengths)} samples:")
print(f"  Min:    {np.min(lengths)}")
print(f"  Max:    {np.max(lengths)}")
print(f"  Mean:   {np.mean(lengths):.0f}")
print(f"  p90:    {np.percentile(lengths, 90):.0f}")
print(f"  p99:    {np.percentile(lengths, 99):.0f}")

over_limit = sum(1 for l in lengths if l > MAX_SEQ_LENGTH)
if over_limit:
    print(f"\n⚠️  WARNING: {over_limit} sample(s) exceed {MAX_SEQ_LENGTH} tokens and will be TRUNCATED.")
    print("   Consider increasing max_seq_length or trimming those samples.")
else:
    print(f"\n✅ All samples fit within {MAX_SEQ_LENGTH} token limit.")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Token length stats across 155 samples:
  Min:    1670
  Max:    7338
  Mean:   3145
  p90:    4324
  p99:    5990

✅ All samples fit within 8192 token limit.


## 4. Load Model + Apply LoRA

In [None]:
# -------------------------------------------------------
# Model config
# -------------------------------------------------------
MODEL_NAME    = "unsloth/Qwen3-1.7B-unsloth-bnb-4bit"
LORA_RANK     = 8   # Small dataset (139 samples) → rank 8 is enough, less overfitting risk

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=True,         # QLoRA — T4 friendly
    fast_inference=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_RANK,      # alpha == rank: conservative updates, good for small datasets
    lora_dropout=0.05,         # small dropout helps generalisation with 139 samples
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print(model.print_trainable_parameters())

==((====))==  Unsloth 2026.2.1: Fast Qwen3 patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.35. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2026.2.1 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


trainable params: 8,716,288 || all params: 1,729,291,264 || trainable%: 0.5040
None


## 5. Prepare Dataset

In [None]:
import random
from datasets import Dataset

# -------------------------------------------------------
# Fix: normalise source_set field to match path
# (your data has some commonTest paths labelled commonMain)
# -------------------------------------------------------
def fix_source_set(item):
    """Correct source_set based on path when they mismatch."""
    try:
        steps = json.loads(item["output"])
        changed = False
        for step in steps:
            path = step.get("path", "")
            ss   = step.get("source_set", "")
            if "commonTest" in path and ss == "commonMain":
                step["source_set"] = "commonTest"
                changed = True
            elif "androidTest" in path and ss == "androidMain":
                step["source_set"] = "androidTest"
                changed = True
            elif "iosTest" in path and ss == "iosMain":
                step["source_set"] = "iosTest"
                changed = True
        if changed:
            item = dict(item)  # don't mutate original
            item["output"] = json.dumps(steps, indent=2)
    except Exception:
        pass  # malformed JSON → leave as-is, will surface during training
    return item

raw_data = [fix_source_set(d) for d in raw_data]

# -------------------------------------------------------
# Format as chat messages
# -------------------------------------------------------
def to_messages(item):
    return {
        "messages": [
            {"role": "system",    "content": SYSTEM_MSG},
            {"role": "user",      "content": item["instruction"]},
            {"role": "assistant", "content": item["output"]},
        ]
    }

dataset_list = [to_messages(x) for x in raw_data]
random.seed(3407)
random.shuffle(dataset_list)

split_idx   = int(len(dataset_list) * 0.9)
train_data  = dataset_list[:split_idx]
val_data    = dataset_list[split_idx:]

print(f"Train: {len(train_data)} samples | Val: {len(val_data)} samples")

# -------------------------------------------------------
# Apply chat template
# -------------------------------------------------------
def apply_template(batch):
    return {
        "text": tokenizer.apply_chat_template(
            batch["messages"],
            tokenize=False,
            add_generation_prompt=False,
            enable_thinking=False
        )
    }

train_dataset = Dataset.from_list(train_data).map(apply_template).remove_columns(["messages"])
val_dataset   = Dataset.from_list(val_data).map(apply_template).remove_columns(["messages"])

# Preview one formatted sample
print("\n--- Formatted sample preview (first 500 chars) ---")
print(train_dataset[0]["text"][:500])

Train: 139 samples | Val: 16 samples


Map:   0%|          | 0/139 [00:00<?, ? examples/s]

Map:   0%|          | 0/16 [00:00<?, ? examples/s]


--- Formatted sample preview (first 500 chars) ---
<|im_start|>system
/no_think You are a Kotlin Multiplatform project planner. Given a user request,
create a step-by-step plan as a JSON array.
Each step: action, path, source_set, description, depends_on.
Order: gradle → common expect → platform actuals → tests.
Respond with ONLY valid JSON.<|im_end|>
<|im_start|>user
Build a file storage system for documents<|im_end|>
<|im_start|>assistant
<think>

</think>

[
  {
    "action": "create_gradle",
    "path": "shared/build.gradle.kts",
    "source


## 6. Train

In [None]:
from trl import SFTTrainer, SFTConfig
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    resume_from_checkpoint=True,
    args=SFTConfig(
        dataset_text_field="text",

        # Batch / accumulation
        per_device_train_batch_size=1,  # ← was missing
        per_device_eval_batch_size=1,
        eval_accumulation_steps=4,
        prediction_loss_only=True,
        gradient_accumulation_steps=8,
        dataloader_pin_memory=False,

        # Schedule
        warmup_steps=10,
        num_train_epochs=10,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",

        # Regularisation
        weight_decay=0.01,

        # Memory
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        max_grad_norm=1.0,

        # Eval & logging
        eval_strategy="steps",
        eval_steps=25,       # ← bumped from 10
        logging_steps=5,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        save_strategy="steps",
        save_steps=25,       # ← must match eval_steps
        save_total_limit=2,

        # Optimizer
        optim="adamw_8bit",

        max_seq_length=MAX_SEQ_LENGTH,
        seed=3407,
        output_dir=f"{SAVE_DIR}/kmp-planner",
        report_to="none",
      ),
)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/139 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/16 [00:00<?, ? examples/s]

🦥 Unsloth: Padding-free auto-enabled, enabling faster training.


In [None]:
from transformers.trainer_utils import get_last_checkpoint
output_dir = f"{SAVE_DIR}/kmp-planner"
last_checkpoint = None
if os.path.isdir(output_dir):
    last_checkpoint = get_last_checkpoint(output_dir)
    if last_checkpoint:
        print(f"Resuming from checkpoint: {last_checkpoint}")
    else:
        print("No checkpoint found, starting fresh")

# Pass it to train(), not to the constructor
trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint)

Resuming from checkpoint: /content/drive/MyDrive/kmp-planner/kmp-planner/checkpoint-180


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 139 | Num Epochs = 10 | Total steps = 180
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 8,716,288 of 1,729,291,264 (0.50% trained)
Could not locate the best model at /content/drive/MyDrive/kmp-planner/kmp-planner/checkpoint-175/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.


Step,Training Loss,Validation Loss


In [None]:
print(f"\nTraining complete. Best eval loss: {trainer.state.best_metric:.4f}")


**Q4_K_M** is the recommended choice for CPU:  
- Uses Q6_K for attention value and feed-forward down tensors (sensitive layers stay higher precision)  
- Q4_K for everything else  
- ~60% size reduction vs f16, minimal quality loss  
- Loads directly into llama.cpp / ollama / LM Studio

Unsloth will auto-install llama.cpp (~3 min) then run the conversion (~10 min on T4).

In [None]:
ADAPTER_DIR = f"adapter"       # raw LoRA weights
MERGED_DIR  = f"merged-16bit"  # merged full model

# Step 1: Save adapter (if not already done)
model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)

# Step 2: Merge to 16bit into its own directory
model.save_pretrained_merged(
    MERGED_DIR,
    tokenizer,
    save_method="merged_16bit",
)


In [None]:
GGUF_DIR    = f"gguf"          # final GGUF output

# Step 3: Create GGUF dir then export directly — no reload needed
os.makedirs(GGUF_DIR, exist_ok=True)

model.save_pretrained_gguf(
    GGUF_DIR,
    tokenizer,
    quantization_method="q4_k_m",
)

Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/752 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [00:49<00:00, 49.76s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [00:44<00:00, 44.45s/it]


Unsloth: Merge process complete. Saved to `/content/gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...




Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['gguf_gguf/qwen3-1.7b.F16.gguf']
Unsloth: [2] Converting GGUF f16 into q4_k_m. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['gguf_gguf/qwen3-1.7b.Q4_K_M.gguf']
Unsloth: example usage for text only LLMs: llama.cpp/llama-cli --model gguf_gguf/qwen3-1.7b.Q4_K_M.gguf -p "why is the sky blue?"
Unsloth: Saved Ollama Modelfile to gguf_gguf/Modelfile
Unsloth: convert model to ollama format by running - ollama create model_name -f gguf_gguf/Modelfile


{'save_directory': 'gguf',
 'gguf_directory': 'gguf_gguf',
 'gguf_files': ['gguf_gguf/qwen3-1.7b.Q4_K_M.gguf'],
 'modelfile_location': 'gguf_gguf/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}

## 10. Sanity Check — Inference with the Fine-tuned Model
Run a quick test on the GPU model before downloading the GGUF.

In [None]:
from unsloth import FastLanguageModel

# Re-use the already-loaded model (or reload if memory was cleared)
FastLanguageModel.for_inference(model)

TEST_PROMPT = "Build a shared authentication module with JWT support"

messages = [
    {"role": "system",    "content": SYSTEM_MSG},
    {"role": "user",      "content": TEST_PROMPT},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=2048,
        do_sample=False,         # greedy — deterministic JSON output
        temperature=1.0,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode only the newly generated tokens
generated = outputs[0][inputs.shape[1]:]
response  = tokenizer.decode(generated, skip_special_tokens=True)

print("=" * 60)
print("PROMPT:", TEST_PROMPT)
print("=" * 60)
print(response)

# Validate it is parseable JSON
try:
    parsed = json.loads(response)
    print(f"\n✅ Valid JSON — {len(parsed)} steps generated")
except json.JSONDecodeError as e:
    print(f"\n⚠️  JSON parse error: {e}")
    print("The model may need more training epochs or the output was truncated.")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


PROMPT: Build a shared authentication module with JWT support
<think>

</think>

[
  {
    "action": "create_gradle",
    "path": "build.gradle.kts",
    "source_set": "gradle",
    "description": "Root build.gradle.kts for KMP project. Apply plugins: kotlin-multiplatform and com.android.library. Configure repositories (mavenCentral, google). Define versions for Kotlin (1.9.x), coroutines (1.7.x), serialization (1.6.x), kotlinx-serialization-json (1.6.x), and commonMain dependencies.",
    "depends_on": []
  },
  {
    "action": "create_gradle",
    "path": "shared/build.gradle.kts",
    "source_set": "gradle",
    "description": "Shared module build.gradle.kts. Apply kotlin('multiplatform') and androidTarget(). Configure targets: androidTarget() with compileSdk=34, minSdk=21; iosX64(), iosArm64(), iosSimulatorArm64(); commonMain() with kotlin multiplatform plugin. Add commonMain dependencies: kotlinx-coroutines-core, kotlinx-serialization-json, kotlinx-datetime, okio, io.ktor-client-c