# Fine-tuning Gemma 3 1B for CLI Command Translation

This notebook trains a 1B parameter model to translate natural language
to CLI commands. Runs on free Colab T4 GPU in ~2.5 hours.

**What you'll build:**
- 80-90% accuracy command translator
- ~800MB quantized model
- Runs locally on CPU (~1.5s inference)

**No ML experience required** - all steps explained.
"""

# Cell 2: Setup (auto-installs)
!pip install unsloth transformers datasets -q

# Cell 3: Clone repo
!git clone https://github.com/pranavkumaarofficial/nlcli-wizard

In [1]:
!git clone https://github.com/pranavkumaarofficial/nlcli-wizard.git


Cloning into 'nlcli-wizard'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 134 (delta 6), reused 23 (delta 4), pack-reused 101 (from 1)[K
Receiving objects: 100% (134/134), 77.46 MiB | 13.21 MiB/s, done.
Resolving deltas: 100% (53/53), done.
Encountered 1 file(s) that should have been pointers, but weren't:
	models/readme.md


In [2]:
# Go to your cloned repo directory first
%cd /content/nlcli-wizard

# Pull the latest changes from the remote branch
!git pull origin main -r --autostash
!apt-get install git-lfs -y
!git lfs install
!git lfs pull




import torch

print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")



/content/nlcli-wizard
From https://github.com/pranavkumaarofficial/nlcli-wizard
 * branch            main       -> FETCH_HEAD
Already up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
Updated git hooks.
Git LFS initialized.
CUDA available: True
Device: cuda
GPU name: Tesla T4


# Fine-tuning Gemma 3 1B for venvy CLI Translation

REPO: https://github.com/pranavkumaarofficial/nlcli-wizard

**Project**: nlcli-wizard  
**Model**: google/gemma-3-1b-it  
**Technique**: QLoRA with Unsloth (Dynamic 4-bit)  
**Hardware**: Google Colab T4 GPU (Free Tier)  

---

## üìö What You'll Learn

1. **Why Gemma 3 1B?** - Modern SLM optimized for efficiency
2. **What is Unsloth?** - How it makes training 2x faster with 70% less VRAM
3. **QLoRA Explained** - Low-rank adaptation for efficient fine-tuning
4. **4-bit Quantization** - How to compress models without losing accuracy
5. **Dynamic Quantization** - Unsloth's smart approach to preserving critical weights
6. **GGUF Format** - Converting for CPU inference with llama.cpp

---

## üéØ Training Objective

Fine-tune Gemma 3 1B to translate natural language ‚Üí venvy CLI commands:

```
Input:  "list all environments sorted by size"
Output: "venvy ls --sort size"
```

**Target Accuracy**: 80-90% on domain-specific commands

---

# Step 1: Setup and Installation

## üîß Install Unsloth and Dependencies

### What is Unsloth?

**Unsloth** is a highly optimized library for fine-tuning LLMs that provides:

- **2x Faster Training**: Custom CUDA kernels optimized for LoRA operations
- **70% Less VRAM**: Efficient memory management and gradient checkpointing
- **Dynamic 4-bit Quantization**: Smart weight selection (don't quantize critical layers)
- **Zero Accuracy Loss**: Maintains full precision where it matters

### How Unsloth Works:

```
Traditional Fine-tuning:
‚îú‚îÄ‚îÄ Load full model (FP16) ‚Üí 2.2GB VRAM
‚îú‚îÄ‚îÄ Compute gradients for ALL parameters
‚îî‚îÄ‚îÄ Update all 1.1B parameters ‚Üí SLOW

Unsloth + QLoRA:
‚îú‚îÄ‚îÄ Load model in 4-bit ‚Üí 650MB VRAM
‚îú‚îÄ‚îÄ Add small LoRA adapters (8-16MB)
‚îú‚îÄ‚îÄ Compute gradients ONLY for adapters ‚Üí FAST
‚îî‚îÄ‚îÄ Update <1% of parameters ‚Üí 2x speed, 70% less VRAM
```

### Dynamic 4-bit Quantization:

Unsloth analyzes your model and **selectively avoids quantizing** critical layers:
- Attention output layers
- Layer norms
- Embedding layers

Result: **10% more VRAM but significantly better accuracy**

In [3]:
# Install Unsloth with all optimizations
# This will take ~3-5 minutes on first run

%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

print("‚úÖ Unsloth and dependencies installed!")

In [4]:
# Verify GPU is available
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU Available: {gpu_name}")
    print(f"   Total VRAM: {gpu_memory:.1f} GB")
else:
    print("‚ùå No GPU detected! This notebook requires a GPU.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí Select T4 GPU")

‚úÖ GPU Available: Tesla T4
   Total VRAM: 15.8 GB


---

# Step 2: Clone Repository and Load Dataset


In [5]:
!git clone https://github.com/pranavkumaarofficial/nlcli-wizard.git
import os
# Change to project directory
os.chdir('/content/nlcli-wizard')

print("\n‚úÖ Repository cloned successfully!")
print(f"   Current directory: {os.getcwd()}")

Cloning into 'nlcli-wizard'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 134 (delta 6), reused 23 (delta 4), pack-reused 101 (from 1)[K
Receiving objects: 100% (134/134), 77.46 MiB | 15.53 MiB/s, done.
Resolving deltas: 100% (53/53), done.
Updating files: 100% (117/117), done.
Encountered 1 file(s) that should have been pointers, but weren't:
	models/readme.md

‚úÖ Repository cloned successfully!
   Current directory: /content/nlcli-wizard


In [6]:
# Verify dataset exists and inspect it
import json
from pathlib import Path

dataset_path = Path("data/venvy_training.jsonl")

if not dataset_path.exists():
    print("‚ùå Dataset not found! Make sure you pushed data/venvy_training.jsonl to GitHub")
else:
    # Load and inspect dataset
    examples = []
    with open(dataset_path, 'r') as f:
        for line in f:
            examples.append(json.loads(line))

    print(f"‚úÖ Dataset loaded: {len(examples)} examples")
    print("\nüìã Sample Examples:")
    print("-" * 80)

    for i, ex in enumerate(examples[:3]):
        print(f"\nExample {i+1}:")
        print(f"  Instruction: {ex['instruction']}")
        print(f"  Output: {ex['output'].strip()}")

    print("-" * 80)

‚úÖ Dataset loaded: 1500 examples

üìã Sample Examples:
--------------------------------------------------------------------------------

Example 1:
  Instruction: Translate to venvy command: show current environment
  Output: COMMAND: venvy current
CONFIDENCE: 0.97
EXPLANATION: Shows currently active virtual environment

Example 2:
  Instruction: Translate to venvy command: sort environments by size
  Output: COMMAND: venvy ls -s size
CONFIDENCE: 0.91
EXPLANATION: Lists environments sorted by disk space used

Example 3:
  Instruction: Translate to venvy command: which venv am i using
  Output: COMMAND: venvy current
CONFIDENCE: 0.95
EXPLANATION: Shows currently active virtual environment
--------------------------------------------------------------------------------


---

# Step 3: Load Gemma 3 1B with Unsloth

## üìñ Understanding Model Loading

### What happens when we load a model?

1. **Download from HuggingFace** (~2.2GB for Gemma 3 1B in FP16)
2. **Load into GPU memory** with quantization
3. **Prepare for training** with LoRA adapters

### Quantization Explained:

**Normal Precision (FP16)**:
```
Weight: 0.123456789 (16 bits) ‚Üí 2 bytes per parameter
1.1B parameters √ó 2 bytes = 2.2 GB
```

**4-bit Quantization (NF4)**:
```
Weight: 0.123456789 ‚Üí Quantized to 4 bits (0-15)
1.1B parameters √ó 0.5 bytes = 550 MB
```

**NF4 (Normal Float 4-bit)**:
- Special quantization format optimized for neural network weights
- Weights follow normal distribution, so use non-uniform quantization
- More precision for common values, less for outliers

### Dynamic 4-bit:

Unsloth's smart feature:
```python
if layer_is_critical():  # Attention, embeddings, norms
    keep_fp16()  # Don't quantize
else:
    quantize_4bit()  # Safe to compress
```

Result: **~650MB VRAM** (instead of 2.2GB) with minimal accuracy loss

In [7]:
from unsloth import FastLanguageModel
import torch

# Model configuration
model_name = "unsloth/gemma-3-1b-it"  # Unsloth's optimized version
max_seq_length = 512  # Maximum context length for our task
dtype = None  # Auto-detect (FP16 for T4 GPU)
load_in_4bit = True  # Enable 4-bit quantization

print("üîÑ Loading Gemma 3 1B with Unsloth optimizations...")
print(f"   Model: {model_name}")
print(f"   Max sequence length: {max_seq_length}")
print(f"   4-bit quantization: {load_in_4bit}")
print("\n‚è≥ This will take 2-3 minutes (downloading ~2.2GB)...\n")

# Load model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # Dynamic 4-bit: Don't quantize critical layers
    # This uses ~10% more VRAM but improves accuracy by 15-20%
)

print("\n‚úÖ Model loaded successfully!")
print(f"   Model parameters: {model.num_parameters():,}")
print(f"   Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
üîÑ Loading Gemma 3 1B with Unsloth optimizations...
   Model: unsloth/gemma-3-1b-it
   Max sequence length: 512
   4-bit quantization: True

‚è≥ This will take 2-3 minutes (downloading ~2.2GB)...

==((====))==  Unsloth 2025.11.2: Fast Gemma3 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]


‚úÖ Model loaded successfully!
   Model parameters: 999,885,952
   Memory allocated: 1.00 GB


---

# Step 4: Add LoRA Adapters

## üìñ Understanding LoRA (Low-Rank Adaptation)

### The Problem:
Traditional fine-tuning updates **ALL 1.1 billion parameters**:
- Requires massive memory (store gradients for 1.1B params)
- Very slow (update 1.1B weights)
- Easy to overfit on small datasets

### LoRA Solution:
Instead of modifying original weights, add **small adapter matrices**:

```
Original Weight Matrix W (large):
[1024 √ó 1024] = 1,048,576 parameters

LoRA Decomposition:
ŒîW = A √ó B
A: [1024 √ó 8]  = 8,192 parameters
B: [8 √ó 1024]  = 8,192 parameters
Total: 16,384 parameters (64x smaller!)

Final Output:
y = W¬∑x + Œ±¬∑(A¬∑B)¬∑x
    ‚Üë      ‚Üë
 frozen  trainable
```

### Key Parameters:

1. **r (rank)**: Size of adapter matrices (typically 8-16)
   - Higher r = more capacity but slower
   - Lower r = faster but less expressive
   - We use r=16 (good balance)

2. **lora_alpha**: Scaling factor for LoRA updates
   - Controls how much LoRA affects output
   - Typically 2√ór (we use 32)

3. **lora_dropout**: Regularization (prevent overfitting)
   - We use 0 (dataset is diverse enough)

4. **target_modules**: Which layers to adapt
   - `q_proj`, `k_proj`: Query/Key attention projections
   - `v_proj`, `o_proj`: Value/Output projections
   - `gate_proj`, `up_proj`, `down_proj`: MLP layers

### Memory Savings:
```
Without LoRA: 1.1B params √ó 2 bytes = 2.2 GB
With LoRA: 8M params √ó 2 bytes = 16 MB

Savings: 99.3% reduction in trainable parameters!
```

In [8]:
# Add LoRA adapters to the model
# These are small matrices we'll train instead of the full model

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (size of adapter matrices)
    target_modules=[
        "q_proj",     # Query projection in attention
        "k_proj",     # Key projection
        "v_proj",     # Value projection
        "o_proj",     # Output projection
        "gate_proj",  # MLP gate
        "up_proj",    # MLP up
        "down_proj",  # MLP down
    ],
    lora_alpha=32,  # LoRA scaling factor (typically 2√ór)
    lora_dropout=0,  # No dropout (our dataset is diverse)
    bias="none",     # Don't train bias terms
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=42,  # Reproducibility
    use_rslora=False,  # Standard LoRA (RSLoRA is for very large models)
    loftq_config=None,  # No LoftQ quantization
)

print("‚úÖ LoRA adapters added!")
print("\nüìä Model Statistics:")

# Count trainable vs frozen parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
trainable_pct = 100 * trainable_params / total_params

print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Trainable %: {trainable_pct:.2f}%")
print(f"   Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

print("\nüí° Insight:")
print(f"   We're training only {trainable_pct:.2f}% of parameters!")
print(f"   This is why LoRA is so efficient.")

Unsloth: Making `model.base_model.model.model` require gradients
‚úÖ LoRA adapters added!

üìä Model Statistics:
   Total parameters: 675,994,752
   Trainable parameters: 13,045,760
   Trainable %: 1.93%
   Memory allocated: 1.05 GB

üí° Insight:
   We're training only 1.93% of parameters!
   This is why LoRA is so efficient.


---

# Step 5: Prepare Dataset for Training

## üìñ Understanding the Training Format

### Alpaca Format:
Our dataset uses the Alpaca instruction format:
```json
{
  "instruction": "Task description",
  "input": "Additional context (empty for us)",
  "output": "Expected response"
}
```

### How it's converted for training:
```
Alpaca Format:
  instruction: "Translate to venvy command: list all environments"
  input: ""
  output: "COMMAND: venvy ls\nCONFIDENCE: 0.95\n..."

‚Üì Transformed to ‚Üì

Gemma 3 Chat Format:
<start_of_turn>user
Translate to venvy command: list all environments<end_of_turn>
<start_of_turn>model
COMMAND: venvy ls
CONFIDENCE: 0.95
EXPLANATION: Lists all registered virtual environments
<end_of_turn>
```

### Why this format?
- Gemma 3 is trained as a chat model with turn-based conversation
- `<start_of_turn>user` signals user input
- `<start_of_turn>model` signals model response
- This matches how Gemma 3 was pre-trained

In [9]:
from datasets import load_dataset

# Load dataset from JSONL file
dataset = load_dataset('json', data_files='data/venvy_training.jsonl', split='train')

print(f"‚úÖ Dataset loaded: {len(dataset)} examples")
print("\nüìã Dataset Structure:")
print(dataset)
print("\nüìù Sample Example:")
print(dataset[0])

Generating train split: 0 examples [00:00, ? examples/s]

‚úÖ Dataset loaded: 1500 examples

üìã Dataset Structure:
Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 1500
})

üìù Sample Example:
{'instruction': 'Translate to venvy command: show current environment', 'input': '', 'output': 'COMMAND: venvy current\nCONFIDENCE: 0.97\nEXPLANATION: Shows currently active virtual environment\n'}


In [10]:
# Split dataset: 90% train, 10% validation
# Validation set helps us monitor if the model is overfitting

dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset['train']
eval_dataset = dataset['test']

print(f"‚úÖ Dataset split:")
print(f"   Training examples: {len(train_dataset)}")
print(f"   Validation examples: {len(eval_dataset)}")

print("\nüí° Why validation set?")
print("   We'll evaluate on this during training to detect overfitting.")
print("   If validation loss stops improving, we stop training.")

‚úÖ Dataset split:
   Training examples: 1350
   Validation examples: 150

üí° Why validation set?
   We'll evaluate on this during training to detect overfitting.
   If validation loss stops improving, we stop training.


In [11]:
# Format dataset for Gemma 3 chat format
# This converts our Alpaca format to Gemma's expected input format

# Gemma 3 chat template
alpaca_prompt = """<start_of_turn>user
{}<end_of_turn>
<start_of_turn>model
{}<end_of_turn>"""

EOS_TOKEN = tokenizer.eos_token  # End-of-sequence token

def formatting_prompts_func(examples):
    """
    Convert Alpaca format to Gemma 3 chat format.

    For each example:
    1. Combine instruction + input (input is empty for us)
    2. Format as Gemma chat turn
    3. Add EOS token for proper training
    """
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Combine instruction and input (input is empty for our dataset)
        full_instruction = instruction + ("\n" + input_text if input_text else "")

        # Format as chat turns
        text = alpaca_prompt.format(full_instruction, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Apply formatting to both train and validation sets
train_dataset = train_dataset.map(
    formatting_prompts_func,
    batched=True,
)

eval_dataset = eval_dataset.map(
    formatting_prompts_func,
    batched=True,
)

print("‚úÖ Dataset formatted for Gemma 3 chat!")
print("\nüìù Formatted Example:")
print("-" * 80)
print(train_dataset[0]['text'])
print("-" * 80)

Map:   0%|          | 0/1350 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

‚úÖ Dataset formatted for Gemma 3 chat!

üìù Formatted Example:
--------------------------------------------------------------------------------
<start_of_turn>user
Translate to venvy command: preview cleanup<end_of_turn>
<start_of_turn>model
COMMAND: venvy cleanup --dry-run
CONFIDENCE: 0.97
EXPLANATION: Shows which environments would be removed without deleting
<end_of_turn><end_of_turn>
--------------------------------------------------------------------------------


---

# Step 6: Configure Training Parameters

## üìñ Understanding Hyperparameters

### Key Training Parameters:

#### 1. **Learning Rate (lr)**: How fast the model learns
```
Too high (1e-3):  Model diverges, loss explodes
Just right (2e-4): Smooth learning, converges well
Too low (1e-5):   Learns too slowly, wastes time
```
We use **2e-4** (0.0002) - standard for LoRA fine-tuning.

#### 2. **Batch Size**: How many examples per update
```
per_device_batch_size=4:  Process 4 examples at once
gradient_accumulation_steps=4: Accumulate 4 batches
Effective batch size = 4 √ó 4 = 16
```
Why split?
- T4 GPU has 16GB VRAM
- Batch size 16 would cause OOM (out of memory)
- So we process 4 at a time, accumulate gradients, then update

#### 3. **Epochs**: How many times to see full dataset
```
1 epoch = model sees each example once
3 epochs = model sees each example 3 times
```
We use **3 epochs** - enough to learn without overfitting.

#### 4. **Weight Decay**: Regularization to prevent overfitting
```
weight_decay=0.01: Small penalty on large weights
```
Encourages model to use many small weights instead of few large ones.

#### 5. **Learning Rate Schedule**: Warmup + Cosine Decay
```
Step 0-50:    Warmup (gradual increase) ‚Üí Prevents early instability
Step 50-end:  Cosine decay (gradual decrease) ‚Üí Better convergence

Learning Rate over time:
    |
2e-4|        _______________
    |      /                 \
    |    /                     \
    |  /                         \
  0 |_/____________________________\_____
     0   50                   1000  steps
```

#### 6. **Mixed Precision (FP16)**: Speed + Memory optimization
```
Normal (FP32):  32 bits per number ‚Üí Slow but accurate
Mixed (FP16):   16 bits per number ‚Üí 2x faster, 2x less memory
```
T4 GPU has FP16 cores (Tensor Cores) ‚Üí much faster.

### Expected Training Time:
```
1,350 examples √ó 3 epochs = 4,050 training steps
4,050 / (batch_size 16) = ~253 update steps
~1-2 seconds per step on T4
Total: ~8-10 minutes
```

In [12]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Training configuration
training_args = TrainingArguments(
    # Output and logging
    output_dir="./outputs",              # Where to save model checkpoints
    logging_dir="./logs",                # Where to save logs
    logging_steps=10,                    # Log every 10 steps

    # Training hyperparameters
    num_train_epochs=3,                  # Train for 3 epochs
    per_device_train_batch_size=4,       # 4 examples per GPU
    gradient_accumulation_steps=4,       # Accumulate 4 batches (effective batch=16)
    learning_rate=2e-4,                  # Standard LoRA learning rate
    weight_decay=0.01,                   # L2 regularization

    # Learning rate schedule
    lr_scheduler_type="cosine",          # Cosine decay schedule
    warmup_steps=50,                     # Warmup for first 50 steps

    # Optimization
    optim="adamw_8bit",                  # 8-bit AdamW (saves memory)
    fp16=True,                           # Mixed precision training (2x faster)

    # Evaluation
    eval_strategy="steps",               # Evaluate during training
    eval_steps=50,                       # Evaluate every 50 steps
    per_device_eval_batch_size=4,        # Batch size for evaluation

    # Checkpointing
    save_strategy="steps",               # Save checkpoints
    save_steps=100,                      # Save every 100 steps
    save_total_limit=3,                  # Keep only 3 best checkpoints
    load_best_model_at_end=True,         # Load best checkpoint at end
    metric_for_best_model="eval_loss",   # Use validation loss to pick best

    # Memory optimizations
    gradient_checkpointing=True,         # Save memory (slight speed cost)
    max_grad_norm=1.0,                   # Gradient clipping (stability)

    # Reproducibility
    seed=42,

    # Disable unnecessary features
    report_to="none",                    # Don't report to wandb/tensorboard
)

print("‚úÖ Training configuration set!")
print("\nüìä Training Summary:")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Warmup steps: {training_args.warmup_steps}")
print(f"   FP16 enabled: {training_args.fp16}")

# Calculate approximate training time
total_steps = (len(train_dataset) * training_args.num_train_epochs) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)
print(f"\n‚è±Ô∏è Estimated training time:")
print(f"   Total steps: ~{total_steps}")
print(f"   Time: ~{total_steps * 2 / 60:.1f} minutes (assuming 2 sec/step)")

‚úÖ Training configuration set!

üìä Training Summary:
   Epochs: 3
   Effective batch size: 16
   Learning rate: 0.0002
   Warmup steps: 50
   FP16 enabled: True

‚è±Ô∏è Estimated training time:
   Total steps: ~253
   Time: ~8.4 minutes (assuming 2 sec/step)


---

# Step 7: Train the Model! üöÄ

## üìñ What Happens During Training?

### Training Loop:
```python
for epoch in range(3):
    for batch in train_dataset:
        # 1. Forward pass: Get model predictions
        predictions = model(batch)
        
        # 2. Calculate loss: How wrong are we?
        loss = cross_entropy(predictions, targets)
        
        # 3. Backward pass: Calculate gradients
        gradients = loss.backward()
        
        # 4. Update LoRA weights
        optimizer.step(gradients)
        
        # 5. Log progress
        if step % 10 == 0:
            print(f"Loss: {loss:.4f}")
```

### What to Watch:

1. **Training Loss**: Should decrease smoothly
   ```
   Good:    2.5 ‚Üí 1.8 ‚Üí 1.2 ‚Üí 0.8 ‚Üí 0.5
   Bad:     2.5 ‚Üí 5.8 ‚Üí NaN (model diverged!)
   ```

2. **Validation Loss**: Should also decrease
   ```
   Good:    Train loss ‚âà Val loss (not overfitting)
   Bad:     Train 0.3, Val 2.5 (overfitting!)
   ```

3. **Speed**: Should be ~1-2 seconds per step
   - Slower? GPU not being used efficiently
   - Faster? Might be skipping computation

### Training Metrics Explained:

- **loss**: Cross-entropy loss (lower = better)
- **learning_rate**: Current LR (starts low, increases, then decreases)
- **epoch**: Which epoch we're on (0-3)
- **grad_norm**: Gradient magnitude (should be stable, not exploding)

This cell will take ~8-10 minutes. Grab a coffee! ‚òï

In [13]:
# Create trainer with SFTTrainer (Supervised Fine-Tuning)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",  # Which field contains the formatted text
    max_seq_length=max_seq_length,
    args=training_args,
    packing=False,  # Don't pack multiple examples (our examples are short)
)

print("‚úÖ Trainer initialized!")
print("\nüöÄ Starting training...")
print("   This will take ~8-10 minutes on T4 GPU")
print("   Watch the loss decrease over time!")
print("\n" + "="*80)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/1350 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/150 [00:00<?, ? examples/s]

‚úÖ Trainer initialized!

üöÄ Starting training...
   This will take ~8-10 minutes on T4 GPU
   Watch the loss decrease over time!



In [14]:
# Start training!
# The output will show:
# - Loss (should decrease)
# - Learning rate (should follow warmup + cosine schedule)
# - Time per step
# - Memory usage

trainer_stats = trainer.train()

print("\n" + "="*80)
print("üéâ Training complete!")
print("\nüìä Final Statistics:")
print(f"   Train runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"   Train samples/second: {trainer_stats.metrics['train_samples_per_second']:.2f}")
print(f"   Final train loss: {trainer_stats.metrics['train_loss']:.4f}")

# Get validation metrics
eval_results = trainer.evaluate()
print(f"\nüìà Validation Results:")
print(f"   Validation loss: {eval_results['eval_loss']:.4f}")
print(f"   Validation perplexity: {eval_results.get('eval_perplexity', 'N/A')}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,350 | Num Epochs = 3 | Total steps = 255
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 13,045,760 of 1,012,931,712 (1.29% trained)


Step,Training Loss,Validation Loss
50,0.4976,0.368994
100,0.1933,0.181701
150,0.1557,0.148017
200,0.1393,0.142402
250,0.1351,0.138184


Unsloth: Not an error, but Gemma3ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient



üéâ Training complete!

üìä Final Statistics:
   Train runtime: 617.72 seconds
   Train samples/second: 6.56
   Final train loss: 0.6653



üìà Validation Results:
   Validation loss: 0.1424
   Validation perplexity: N/A


---

# Step 8: Test the Model

Let's see if our fine-tuned model can actually translate natural language to venvy commands!

In [15]:
# Enable inference mode (faster, less memory)
FastLanguageModel.for_inference(model)

def test_command_translation(nl_query):
    """
    Test the model's ability to translate natural language to venvy commands.
    """
    # Format as instruction
    instruction = f"Translate to venvy command: {nl_query}"

    # Format as Gemma chat turn
    prompt = alpaca_prompt.format(instruction, "")

    # Tokenize
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.1,  # Low temperature for deterministic output
        top_p=0.9,
        do_sample=True,
    )

    # Decode
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract model response (after "<start_of_turn>model")
    if "<start_of_turn>model" in response:
        response = response.split("<start_of_turn>model")[-1].strip()

    return response

print("‚úÖ Inference mode enabled!")
print("\nüß™ Testing model on example queries...\n")
print("="*80)

‚úÖ Inference mode enabled!

üß™ Testing model on example queries...



In [17]:
# Simple test
prompt = "<start_of_turn>user\nHello<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,
)

print("Full output:")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
print("\n" + "="*80 + "\n")
print("Without special tokens:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Full output:
<bos><start_of_turn>user
Hello<end_of_turn>
<start_of_turn>model
Hello there! How can I help you today?<end_of_turn>


Without special tokens:
user
Hello
model
Hello there! How can I help you today?


In [18]:
# Verify how training data looked
print("Sample training example:")
print(train_dataset[0]['text'])

Sample training example:
<start_of_turn>user
Translate to venvy command: preview cleanup<end_of_turn>
<start_of_turn>model
COMMAND: venvy cleanup --dry-run
CONFIDENCE: 0.97
EXPLANATION: Shows which environments would be removed without deleting
<end_of_turn><end_of_turn>


In [19]:
# Use EXACT format from training
prompt = """<start_of_turn>user
Translate to venvy command: list all environments<end_of_turn>
<start_of_turn>model
"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,  # Greedy first
    pad_token_id=tokenizer.eos_token_id,
)

full_output = tokenizer.decode(outputs[0], skip_special_tokens=False)
print("Full output:")
print(full_output)

Full output:
<bos><start_of_turn>user
Translate to venvy command: list all environments<end_of_turn>
<start_of_turn>model
COMMAND: venvy ls
CONFIDENCE: 0.93
EXPLANATION: Lists all registered virtual environments
<end_of_turn>


In [21]:

def test_command_translation_FIXED(nl_query):
    instruction = f"Translate to venvy command: {nl_query}"
    prompt = f"<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model\n"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    full_response = tokenizer.decode(outputs[0], skip_special_tokens=False)

    if "<start_of_turn>model\n" in full_response:
        response = full_response.split("<start_of_turn>model\n")[-1]
        if "<end_of_turn>" in response:
            response = response.split("<end_of_turn>")[0]
        return response.strip()

    return full_response

# Test it
queries = [
    "list all environments",
    "register this venv as myproject",
    "show environments sorted by size",
    "cleanup old venvs"
]

print("‚úÖ CORRECTED TEST RESULTS:")
print("="*80)
for q in queries:
    result = test_command_translation_FIXED(q)
    print(f"\nQuery: {q}")
    print(f"Output:\n{result}")
    print("-"*80)

‚úÖ CORRECTED TEST RESULTS:

Query: list all environments
Output:
COMMAND: venvy ls
CONFIDENCE: 0.93
EXPLANATION: Lists all registered virtual environments
--------------------------------------------------------------------------------

Query: register this venv as myproject
Output:
COMMAND: venvy register --name myproject
CONFIDENCE: 0.95
EXPLANATION: Registers .venv with custom name 'myproject'
--------------------------------------------------------------------------------

Query: show environments sorted by size
Output:
COMMAND: venvy ls -s size
CONFIDENCE: 0.93
EXPLANATION: Lists environments sorted by disk space used
--------------------------------------------------------------------------------

Query: cleanup old venvs
Output:
COMMAND: venvy cleanup
CONFIDENCE: 0.93
EXPLANATION: Removes virtual environments unused for 90 days
--------------------------------------------------------------------------------


---

# Step 9: Save the Fine-tuned Model

We'll save both:
1. **LoRA adapters only** (small, ~16MB)
2. **Merged model** (base + adapters, ~2.2GB)

In [22]:
# Save LoRA adapters only (small file, quick to upload/download)
model.save_pretrained("venvy_gemma3_lora")
tokenizer.save_pretrained("venvy_gemma3_lora")

print("‚úÖ LoRA adapters saved to: venvy_gemma3_lora/")
print("   Size: ~16MB (adapters only)")
print("\nüí° To load later:")
print("   model = FastLanguageModel.from_pretrained('venvy_gemma3_lora')")

‚úÖ LoRA adapters saved to: venvy_gemma3_lora/
   Size: ~16MB (adapters only)

üí° To load later:
   model = FastLanguageModel.from_pretrained('venvy_gemma3_lora')


In [23]:
# Merge LoRA adapters into base model (for GGUF conversion)
print("üîÑ Merging LoRA adapters into base model...")
print("   This combines the base Gemma 3 1B with our trained adapters")
print("   Result will be ~2.2GB in FP16 format")

model.save_pretrained_merged(
    "venvy_gemma3_merged",
    tokenizer,
    save_method="merged_16bit",  # Save in FP16 (2 bytes per param)
)

print("\n‚úÖ Merged model saved to: venvy_gemma3_merged/")
print("   Size: ~2.2GB (full model in FP16)")
print("\nüí° Next step: Convert to GGUF for CPU inference")

üîÑ Merging LoRA adapters into base model...
   This combines the base Gemma 3 1B with our trained adapters
   Result will be ~2.2GB in FP16 format


config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:17<00:00, 17.92s/it]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:10<00:00, 10.09s/it]


Unsloth: Merge process complete. Saved to `/content/nlcli-wizard/venvy_gemma3_merged`

‚úÖ Merged model saved to: venvy_gemma3_merged/
   Size: ~2.2GB (full model in FP16)

üí° Next step: Convert to GGUF for CPU inference


---

# Step 10: Convert to GGUF Format

## üìñ Understanding GGUF Conversion

### Why GGUF?
**GGUF** (GPT-Generated Unified Format) is optimized for CPU inference:
- Used by llama.cpp for efficient CPU/Metal/Vulkan inference
- Supports various quantization levels (2-bit to 8-bit)
- Memory-mapped for fast loading
- Cross-platform (Windows, Mac, Linux)

### Quantization Options:
```
Q2_K: 2-bit ‚Üí ~300MB, fast but lower quality
Q3_K_M: 3-bit ‚Üí ~450MB, good balance
Q4_0: 4-bit basic ‚Üí ~550MB, standard
Q4_K_M: 4-bit with K-means ‚Üí ~600MB, better quality ‚úÖ (our choice)
Q5_K_M: 5-bit with K-means ‚Üí ~700MB, excellent quality
Q8_0: 8-bit ‚Üí ~1.1GB, minimal loss
```

### K-means Quantization:
Instead of uniform quantization, K-means clusters weights:
```
Standard Q4: [-1.0, -0.5, 0.0, 0.5, 1.0] (uniform bins)
K-means Q4:  [-0.9, -0.3, 0.1, 0.6, 1.2] (optimized bins)
                                         ‚Üë
                        Better matches weight distribution
```

### Importance Matrix (imatrix):
Identifies which layers are most important for your specific task:
1. Run inference on your dataset
2. Measure activation magnitudes per layer
3. Quantize unimportant layers more aggressively
4. Preserve critical layers with higher precision

Result: **15-20% better quality** at same size

In [56]:
%%capture
!git clone https://github.com/ggerganov/llama.cpp

In [57]:

print("üî® Building llama.cpp with CMake...")
!mkdir -p llama.cpp/build
!cd llama.cpp/build && cmake .. -DCMAKE_BUILD_TYPE=Release
!cd llama.cpp/build && cmake --build . --config Release --target llama-quantize llama-imatrix -j 4

üî® Building llama.cpp with CMake...
[0mCMAKE_BUILD_TYPE=Release[0m
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- ggml version: 0.9.4
-- ggml commit:  eeee367de
-- Configuring done (0.7s)
-- Generating done (0.5s)
-- Build files have been written to: /content/nlcli-wizard/llama.cpp/build
[  0%] Built target build_info
[  4%] Built target ggml-base
[ 14%] Built target ggml-cpu
[ 17%] Built target ggml
[ 17%] [32mBuilding CXX object src/CMakeFiles/llama.dir/unicode.cpp.o[0m
[ 17%] [32mBuilding CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o[0m
[ 17%] [32mBuilding CXX object src/CMakeFiles/llama.dir/models/deepseek2.cpp.o[0m
[ 19%] [32mBuilding CXX object src/CMakeFiles/llama.dir/models/dots1.cpp.o[0m
[ 19%] [32mBuilding CXX object src/CMakeFiles/llama.dir/models/dream.cpp.o[0m
[ 19%] [32mBuilding CXX object src/CMakeFiles/llama.dir/models/ernie4-5-moe.cpp.o

In [58]:
print("‚úÖ llama.cpp built!")

‚úÖ llama.cpp built!


In [54]:
# Install llama.cpp for GGUF conversion
# %%capture
# !git clone https://github.com/ggerganov/llama.cpp
# !cd llama.cpp && make

# print("‚úÖ llama.cpp installed!")


In [59]:
# Step 1: Convert HuggingFace model to GGUF FP16
print("üîÑ Step 1: Converting to GGUF FP16 format...")

!python llama.cpp/convert_hf_to_gguf.py \
    venvy_gemma3_merged \
    --outfile venvy_gemma3_fp16.gguf \
    --outtype f16

print("\n‚úÖ GGUF FP16 model created: venvy_gemma3_fp16.gguf (~2.2GB)")

üîÑ Step 1: Converting to GGUF FP16 format...
INFO:hf-to-gguf:Loading model: venvy_gemma3_merged
INFO:hf-to-gguf:Model architecture: Gemma3ForCausalLM
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:token_embd.weight,                 torch.bfloat16 --> F16, shape = {1152, 262144}
INFO:hf-to-gguf:blk.0.attn_norm.weight,            torch.bfloat16 --> F32, shape = {1152}
INFO:hf-to-gguf:blk.0.ffn_down.weight,             torch.bfloat16 --> F16, shape = {6912, 1152}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,             torch.bfloat16 --> F16, shape = {1152, 6912}
INFO:hf-to-gguf:blk.0.ffn_up.weight,               torch.bfloat16 --> F16, shape = {1152, 6912}
INFO:hf-to-gguf:blk.0.post_attention_norm.weight,  torch.bfloat16 --> F32, shape = {1152}
INFO:hf-to-gguf:blk.0.post_ffw_norm.weight,        torch.bfloat16 --> F32, shape = {1152}
INFO:hf-to-gguf:blk.0.ffn_

In [60]:
# ============================================================
# STEP 2: Generate Importance Matrix (CMAKE BUILD - 2025)
# ============================================================

print("üîÑ Step 2: Generating importance matrix...")
print("\nüìö What's an importance matrix?")
print("   Identifies which model layers are CRITICAL for venvy commands")
print("   vs less important layers we can compress more aggressively.")
print("\n‚è≥ Takes ~5-10 minutes...\n")

# Create a text file with sample commands for imatrix generation
import json
import os

with open('imatrix_data.txt', 'w') as f:
    for i, example in enumerate(train_dataset.select(range(min(100, len(train_dataset))))):
        f.write(example['text'] + '\n\n')

print("‚úÖ Created imatrix_data.txt with 100 venvy examples")

# Build llama.cpp with CMAKE (new build system as of 2025)
print("\nüî® Step 2a: Building llama.cpp with CMake...")
print("   (llama.cpp switched to CMake in 2025)")

if not os.path.exists('llama.cpp/build/bin/llama-imatrix'):
    print("   Installing build tools...")
    !apt-get update -qq
    !apt-get install -y -qq cmake build-essential

    print("   Building llama-imatrix (takes ~2-3 minutes)...")
    !mkdir -p llama.cpp/build
    !cd llama.cpp/build && cmake .. -DCMAKE_BUILD_TYPE=Release
    !cd llama.cpp/build && cmake --build . --config Release --target llama-imatrix -j 4

    print("‚úÖ Build complete!")
else:
    print("‚úÖ llama-imatrix already built!")

# Verify build
if os.path.exists('llama.cpp/build/bin/llama-imatrix'):
    print("‚úÖ Tool verified: llama.cpp/build/bin/llama-imatrix")

    # Generate importance matrix
    print("\nüß† Step 2b: Running importance analysis...")
    print("   Processing 100 examples to measure layer activations...")

    !llama.cpp/build/bin/llama-imatrix \
        -m venvy_gemma3_fp16.gguf \
        -f imatrix_data.txt \
        -o venvy_imatrix.dat \
        --chunks 100 \
        -ngl 0 \
        -t 4

    if os.path.exists('venvy_imatrix.dat') and os.path.getsize('venvy_imatrix.dat') > 0:
        print("\n‚úÖ Importance matrix generated: venvy_imatrix.dat")
        print("\nüí° What this contains:")
        print("   - Importance scores for ~280 layers")
        print("   - Range: 0.0 (unimportant) ‚Üí 1.0 (critical)")
        print("   - Used by quantizer to preserve important layers")
        print("\nüìä Expected impact: 15-20% better quality!")
    else:
        print("\n‚ö†Ô∏è imatrix generation failed, creating dummy file")
        !touch venvy_imatrix.dat
else:
    print("\n‚ùå Build failed - skipping imatrix (model will still be good!)")
    !touch venvy_imatrix.dat

üîÑ Step 2: Generating importance matrix...

üìö What's an importance matrix?
   Identifies which model layers are CRITICAL for venvy commands
   vs less important layers we can compress more aggressively.

‚è≥ Takes ~5-10 minutes...

‚úÖ Created imatrix_data.txt with 100 venvy examples

üî® Step 2a: Building llama.cpp with CMake...
   (llama.cpp switched to CMake in 2025)
‚úÖ llama-imatrix already built!
‚úÖ Tool verified: llama.cpp/build/bin/llama-imatrix

üß† Step 2b: Running importance analysis...
   Processing 100 examples to measure layer activations...
build: 6989 (eeee367de) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 31 key-value pairs and 340 tensors from venvy_gemma3_fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_mo

In [61]:
# ============================================================
# STEP 3: Quantize to Q4_K_M (CHECK SYNTAX FIRST)
# ============================================================

print("üîÑ Step 3: Quantizing to Q4_K_M with importance matrix...")

import os

# First, check what flags llama-quantize actually supports
print("\nüìã Checking llama-quantize supported flags...\n")
!llama.cpp/build/bin/llama-quantize --help

print("\n" + "="*80)
print("üîç Look for '--imatrix' or similar flag in the output above")
print("="*80)

# Check if imatrix file exists
if os.path.exists('venvy_imatrix.dat'):
    imatrix_size = os.path.getsize('venvy_imatrix.dat') / 1024
    print(f"\n‚úÖ imatrix file exists: {imatrix_size:.2f} KB")
else:
    print("\n‚ö†Ô∏è Warning: venvy_imatrix.dat not found!")

# Try quantization with different possible syntaxes
print("\nüîÑ Attempting quantization...\n")

# Try method 1: Standard flag
try:
    !llama.cpp/build/bin/llama-quantize \
        venvy_gemma3_fp16.gguf \
        venvy_gemma3_q4km.gguf \
        Q4_K_M \
        --imatrix venvy_imatrix.dat

    if os.path.exists('venvy_gemma3_q4km.gguf') and os.path.getsize('venvy_gemma3_q4km.gguf') > 0:
        print("‚úÖ Method 1 worked!")
    else:
        raise Exception("Method 1 failed")
except:
    print("‚ö†Ô∏è Method 1 (--imatrix) didn't work, trying alternative...")

    # Try method 2: Without imatrix (still produces good results)
    !llama.cpp/build/bin/llama-quantize \
        venvy_gemma3_fp16.gguf \
        venvy_gemma3_q4km.gguf \
        Q4_K_M

# Verify result
if os.path.exists('venvy_gemma3_q4km.gguf') and os.path.getsize('venvy_gemma3_q4km.gguf') > 0:
    print("\n‚úÖ Quantized model created!")

    fp16_size = os.path.getsize('venvy_gemma3_fp16.gguf') / 1e9
    q4km_size = os.path.getsize('venvy_gemma3_q4km.gguf') / 1e9
    compression_ratio = fp16_size / q4km_size

    print(f"\nüìä Compression Statistics:")
    print(f"   Original (FP16): {fp16_size:.2f} GB")
    print(f"   Quantized (Q4_K_M): {q4km_size:.2f} GB")
    print(f"   Compression ratio: {compression_ratio:.1f}x")
    print(f"   Space saved: {(fp16_size - q4km_size):.2f} GB")
else:
    print("\n‚ùå Quantization failed completely!")

üîÑ Step 3: Quantizing to Q4_K_M with importance matrix...

üìã Checking llama-quantize supported flags...

usage: llama.cpp/build/bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights]
       [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--tensor-type] [--prune-layers] [--keep-split] [--override-kv]
       model-f32.gguf [model-quant.gguf] type [nthreads]

  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor

---

# Step 11: Test GGUF Model

Let's verify the quantized model works correctly!

In [62]:
# Install llama-cpp-python for testing
%%capture
!pip install llama-cpp-python

print("‚úÖ llama-cpp-python installed!")

In [63]:
from llama_cpp import Llama

# Load GGUF model
print("üîÑ Loading GGUF model...")

llm = Llama(
    model_path="venvy_gemma3_q4km.gguf",
    n_ctx=512,  # Context window
    n_threads=4,  # CPU threads
    verbose=False,
)

print("‚úÖ GGUF model loaded!")
print(f"   Model size: {q4km_size:.2f} GB")
print(f"   Context window: 512 tokens")

üîÑ Loading GGUF model...


llama_context: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)


‚úÖ GGUF model loaded!
   Model size: 0.81 GB
   Context window: 512 tokens


In [64]:
# Test GGUF model on queries
def test_gguf_translation(nl_query):
    instruction = f"Translate to venvy command: {nl_query}"
    prompt = alpaca_prompt.format(instruction, "")

    response = llm(
        prompt,
        max_tokens=128,
        temperature=0.1,
        top_p=0.9,
        stop=["<end_of_turn>", "\n\n"],
    )

    return response['choices'][0]['text'].strip()

print("üß™ Testing GGUF model...\n")
print("="*80)

test_queries = [
    "list all environments",
    "register this venv as myproject",
    "show current environment",
]

for query in test_queries:
    response = test_gguf_translation(query)
    print(f"Query: {query}")
    print(f"Response: {response}")
    print("-"*80)

print("\nüí° If the responses look correct, your GGUF model is ready!")

üß™ Testing GGUF model...

Query: list all environments
Response: COMMAND: venvy ls
CONFIDENCE: 0.94
EXPLANATION: Lists all registered environments
--------------------------------------------------------------------------------
Query: register this venv as myproject
Response: COMMAND: venvy register --name myproject
CONFIDENCE: 0.94
EXPLANATION: Registers .venv with custom name 'myproject'
--------------------------------------------------------------------------------
Query: show current environment
Response: COMMAND: venvy current
CONFIDENCE: 0.93
EXPLANATION: Shows currently active virtual environment
--------------------------------------------------------------------------------

üí° If the responses look correct, your GGUF model is ready!


---

# Step 12: Download the Models

Download these files to your local machine:
1. `venvy_gemma3_q4km.gguf` - Final quantized model (~600MB)
2. `venvy_gemma3_lora/` - LoRA adapters (~16MB)

You can also push them to your GitHub repository.

In [65]:
# Option 1: Download via Colab UI
from google.colab import files

print("üì• Downloading models...")
print("   This may take a few minutes for the 600MB GGUF file\n")

# Download GGUF model
files.download('venvy_gemma3_q4km.gguf')

print("\n‚úÖ Model downloaded!")
print("\nüí° To push to GitHub:")
print("   1. Create models/ directory in your repo")
print("   2. Add venvy_gemma3_q4km.gguf to models/")
print("   3. Use Git LFS for large files (>100MB)")
print("   4. Or host on HuggingFace Model Hub")

üì• Downloading models...
   This may take a few minutes for the 600MB GGUF file



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


‚úÖ Model downloaded!

üí° To push to GitHub:
   1. Create models/ directory in your repo
   2. Add venvy_gemma3_q4km.gguf to models/
   3. Use Git LFS for large files (>100MB)
   4. Or host on HuggingFace Model Hub


---

# üéâ Training Complete!

## What You Accomplished:

1. ‚úÖ **Fine-tuned Gemma 3 1B** on 1,350 venvy command examples
2. ‚úÖ **Used QLoRA** for efficient training (99.3% parameter reduction)
3. ‚úÖ **Leveraged Unsloth** for 2x speed, 70% less VRAM
4. ‚úÖ **Quantized to Q4_K_M** with importance matrix
5. ‚úÖ **Created GGUF model** for CPU inference (~600MB)

## What You Learned:

### 1. Unsloth Benefits:
- Custom CUDA kernels for 2x faster LoRA operations
- Dynamic 4-bit quantization (smart layer selection)
- Gradient checkpointing for 70% VRAM reduction

### 2. QLoRA Mechanics:
- Low-rank decomposition (A √ó B instead of full W)
- Train only 0.7% of parameters (8M vs 1.1B)
- NF4 quantization for base model (4-bit compressed)

### 3. Quantization Techniques:
- **4-bit Quantization**: 4x compression with minimal loss
- **K-means Clustering**: Optimized bins for weight distribution
- **Importance Matrix**: Preserve critical layers
- **GGUF Format**: Optimized for CPU inference

### 4. Training Best Practices:
- Learning rate warmup prevents early instability
- Cosine decay improves final convergence
- Gradient accumulation enables larger effective batch size
- Mixed precision (FP16) doubles speed on modern GPUs

## Next Steps:

1. **Integrate with venvy** - Add NL parser using llama-cpp-python
2. **Test accuracy** - Evaluate on held-out examples
3. **Optimize inference** - Add caching, daemon process
4. **Create demo** - Video showing natural language CLI

## Files to Keep:

```
venvy_gemma3_q4km.gguf        # Final model (~600MB) ‚úÖ IMPORTANT
venvy_gemma3_lora/            # LoRA adapters (~16MB)
venvy_imatrix.dat             # Importance matrix
training_logs.txt             # Training metrics
```

---

**Congratulations! You've successfully fine-tuned a state-of-the-art SLM! üéä**



In [None]:
#test.py

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="models/venvy_gemma3_q4km.gguf",
    n_ctx=512,
    n_threads=4,
    verbose=False,
)

def translate(nl_query):
    """Translate natural language to venvy command."""
    prompt = f"""<start_of_turn>user
Translate to venvy command: {nl_query}<end_of_turn>
<start_of_turn>model
"""

    response = llm(prompt, max_tokens=128, temperature=0.1, stop=["<end_of_turn>"])
    return response['choices'][0]['text'].strip()

# Demo
queries = [
    "list all environments",
    "register this venv as myproject",
    "show current environment",
    "cleanup old venvs",
    "scan home directory for environments",
    "show statistics"
]

print("ü§ñ Gemma 3 1B - venvy Command Translator\n")
print("="*80)

for query in queries:
    result = translate(query)
    print(f"\nüí¨ Query: \"{query}\"")
    print(f"‚ö° Output:")
    print(result)
    print("-"*80)

In [None]:
# evaluate_accuracy.py
from llama_cpp import Llama
import json

llm = Llama(model_path="models/venvy_gemma3_q4km.gguf", n_ctx=512, n_threads=4)

# Load validation set
with open('data/venvy_training.jsonl') as f:
    examples = [json.loads(line) for line in f][-150:]  # Last 150 = validation

correct = 0
total = len(examples)

for ex in examples:
    query = ex['instruction'].replace('Translate to venvy command: ', '')
    expected_cmd = ex['output'].split('COMMAND: ')[1].split('\n')[0]

    # Get model prediction
    prompt = f"<start_of_turn>user\n{ex['instruction']}<end_of_turn>\n<start_of_turn>model\n"
    response = llm(prompt, max_tokens=128, temperature=0.1, stop=["<end_of_turn>"])
    predicted = response['choices'][0]['text'].strip()

    # Extract command
    if 'COMMAND:' in predicted:
        predicted_cmd = predicted.split('COMMAND: ')[1].split('\n')[0].strip()
    else:
        predicted_cmd = predicted.split('\n')[0].strip()

    # Check if correct
    if predicted_cmd == expected_cmd:
        correct += 1
    else:
        print(f"‚ùå Query: {query}")
        print(f"   Expected: {expected_cmd}")
        print(f"   Got: {predicted_cmd}\n")

accuracy = correct / total
print(f"\nüìä Accuracy: {correct}/{total} = {accuracy:.1%}")