# üá±üá∞ Production-Grade LLM Fine-Tuning: Sri Lankan AI Tour Guide

**Model:** Meta-Llama-3.1-8B-Instruct (4-bit Quantized)  
**Framework:** Unsloth + LoRA (Low-Rank Adaptation)  
**Environment:** Google Colab Free Tier (T4 GPU)  
**Use Case:** Fine-tuning for culturally-aware Sri Lankan tourism assistance

---

## üìã Workflow Overview

1. **Environment Setup** - Install dependencies and verify GPU
2. **Model Loading** - Load quantized Llama 3.1 8B with LoRA adapters
3. **Data Preparation** - Format JSONL dataset for instruction tuning
4. **Training** - Fine-tune with optimized hyperparameters
5. **Inference Testing** - Compare before/after performance
6. **Model Export** - Save as GGUF for local deployment

---

## Step 1: Environment Setup & Dependency Installation

Installing the Unsloth optimization framework along with required libraries:
- **unsloth**: 2-5x faster training, 80% less memory usage
- **xformers**: Memory-efficient attention mechanisms
- **trl**: Transformer Reinforcement Learning (SFTTrainer)
- **peft**: Parameter-Efficient Fine-Tuning (LoRA)
- **accelerate**: Distributed training utilities
- **bitsandbytes**: 4-bit quantization support

In [None]:
%%capture
# Install Unsloth and dependencies (suppress output for cleaner notebook)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

### Verify GPU Availability

Ensuring we have access to a CUDA-capable GPU (T4 expected on Colab Free Tier).

In [None]:
import torch

# Check GPU availability and specifications
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9  # Convert to GB
    print(f"‚úÖ GPU Detected: {gpu_name}")
    print(f"üìä Total GPU Memory: {gpu_memory:.2f} GB")
    print(f"üî¢ CUDA Version: {torch.version.cuda}")
else:
    print("‚ùå No GPU detected. Please enable GPU in Runtime > Change runtime type > Hardware accelerator > GPU")
    raise RuntimeError("GPU is required for this notebook")

## Step 2: Model Loading with Unsloth Optimization

Loading the **Meta-Llama-3.1-8B-Instruct** model in 4-bit quantization mode with LoRA adapters.

### LoRA Configuration:
- **r=16**: Rank of the low-rank matrices (higher = more capacity but slower)
- **lora_alpha=16**: Scaling factor for LoRA updates
- **Target Modules**: All linear projection layers in the transformer architecture
  - `q_proj`, `k_proj`, `v_proj`: Query, Key, Value projections (attention)
  - `o_proj`: Output projection (attention)
  - `gate_proj`, `up_proj`, `down_proj`: Feed-forward network layers

This configuration maximizes model expressiveness while remaining memory-efficient.

In [None]:
from unsloth import FastLanguageModel
import torch

# Model configuration
MODEL_NAME = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
MAX_SEQ_LENGTH = 2048  # Maximum sequence length for training
LOAD_IN_4BIT = True    # Use 4-bit quantization

# Load the model and tokenizer with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect optimal dtype (float16 for T4)
    load_in_4bit=LOAD_IN_4BIT,
)

print(f"‚úÖ Model loaded: {MODEL_NAME}")
print(f"üìè Max Sequence Length: {MAX_SEQ_LENGTH}")
print(f"üîß Quantization: 4-bit")

### Configure LoRA Adapters

Applying LoRA (Low-Rank Adaptation) to enable efficient fine-tuning. Instead of updating all 8 billion parameters, we only train small adapter matrices, reducing memory usage by ~80%.

In [None]:
# Apply LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - higher values = more parameters but better performance
    lora_alpha=16,  # LoRA scaling factor (typically equal to r)
    lora_dropout=0,  # Dropout for LoRA layers (0 = no dropout, recommended for small datasets)
    target_modules=[
        "q_proj",    # Query projection (attention)
        "k_proj",    # Key projection (attention)
        "v_proj",    # Value projection (attention)
        "o_proj",    # Output projection (attention)
        "gate_proj", # Gate projection (FFN)
        "up_proj",   # Up projection (FFN)
        "down_proj", # Down projection (FFN)
    ],
    bias="none",  # Don't add LoRA to bias terms
    use_gradient_checkpointing="unsloth",  # Enable gradient checkpointing for memory efficiency
    random_state=42,  # Seed for reproducibility
    use_rslora=False,  # Rank-stabilized LoRA (optional, can improve stability)
    loftq_config=None,  # LoftQ quantization config (advanced)
)

print("‚úÖ LoRA adapters applied successfully")
print(f"üìä Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"üìä Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"üìà Trainable %: {100 * sum(p.numel() for p in model.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters()):.2f}%")

## Step 3: Data Preparation & Formatting

### Dataset Structure

Our JSONL dataset contains instruction/response pairs. We'll format them into the **Alpaca instruction format** with a custom system prompt that defines the AI's personality and behavior.

### System Prompt Design

The system prompt establishes:
- **Identity**: Expert Sri Lankan Tour Guide
- **Tone**: Warm, hospitable (using "Ayubowan" greeting)
- **Capabilities**: Accurate travel advice, logistical planning
- **Technical Requirements**: Strict JSON formatting for tool calls

In [None]:
# Define the custom system prompt for the Sri Lankan Tour Guide
SYSTEM_PROMPT = """You are an expert Sri Lankan Tour Guide. You speak with a warm, hospitable tone ('Ayubowan'). You provide accurate travel advice, check for logistical constraints, and format tool calls strictly as JSON."""

# Alpaca-style instruction template
ALPACA_PROMPT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

print("‚úÖ Prompt templates defined")
print(f"\nüìù System Prompt:\n{SYSTEM_PROMPT}")

### Upload Dataset

Upload your `finetune_dataset.jsonl` file using the file upload widget below.

**Note**: If running locally or in a different environment, modify the file path accordingly.

In [None]:
from google.colab import files
import os

# Upload the dataset file
print("üì§ Please upload your finetune_dataset.jsonl file:")
uploaded = files.upload()

# Verify the file was uploaded
if 'finetune_dataset.jsonl' in uploaded:
    print("‚úÖ Dataset uploaded successfully")
    DATASET_PATH = "finetune_dataset.jsonl"
else:
    print("‚ùå Expected file 'finetune_dataset.jsonl' not found")
    print(f"Available files: {list(uploaded.keys())}")
    # Use the first uploaded file if available
    if uploaded:
        DATASET_PATH = list(uploaded.keys())[0]
        print(f"‚ö†Ô∏è Using {DATASET_PATH} instead")

### Load and Format Dataset

Loading the JSONL file and converting each record into the Alpaca instruction format. The system prompt is integrated into the instruction field to condition the model's behavior.

In [None]:
import json
from datasets import Dataset

def load_jsonl(file_path):
    """Load JSONL file into a list of dictionaries."""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

def format_alpaca_prompt(instruction, input_text, output_text):
    """Format a single example into Alpaca instruction format."""
    return ALPACA_PROMPT.format(instruction, input_text, output_text)

def formatting_prompts_func(examples):
    """
    Format dataset examples for instruction tuning.
    
    This function:
    1. Combines the system prompt with the task instruction
    2. Uses the input field as context
    3. Formats everything into the Alpaca template
    """
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    
    texts = []
    for instruction, input_text, output_text in zip(instructions, inputs, outputs):
        # Combine system prompt with the specific instruction
        full_instruction = f"{SYSTEM_PROMPT}\n\n{instruction}"
        
        # Format using Alpaca template
        text = format_alpaca_prompt(full_instruction, input_text, output_text)
        texts.append(text)
    
    return {"text": texts}

# Load the JSONL dataset
print(f"üìÇ Loading dataset from: {DATASET_PATH}")
raw_data = load_jsonl(DATASET_PATH)
print(f"‚úÖ Loaded {len(raw_data)} examples")

# Convert to Hugging Face Dataset format
dataset = Dataset.from_list(raw_data)
print(f"‚úÖ Dataset converted to Hugging Face format")
print(f"üìä Dataset features: {dataset.features}")
print(f"üìä Dataset size: {len(dataset)} examples")

### Debug: Inspect Formatted Examples

Before training, let's verify the data formatting is correct by examining a sample.

In [None]:
# Format the dataset and inspect the first example
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    num_proc=2,  # Use 2 processes for faster formatting
    remove_columns=dataset.column_names,  # Remove original columns, keep only 'text'
)

print("=" * 80)
print("üìã FORMATTED EXAMPLE (First Training Sample)")
print("=" * 80)
print(formatted_dataset[0]['text'])
print("=" * 80)
print(f"\n‚úÖ Total formatted examples: {len(formatted_dataset)}")
print(f"üìè Average text length: {sum(len(x['text']) for x in formatted_dataset) / len(formatted_dataset):.0f} characters")

## Step 4: Training Configuration & Fine-Tuning

### Hyperparameter Rationale

- **max_seq_length=2048**: Supports longer conversations while fitting in T4 GPU memory
- **batch_size=2 + grad_accum=4**: Effective batch size of 8 (balances speed vs. memory)
- **learning_rate=2e-4**: Standard for LoRA fine-tuning (higher than full fine-tuning)
- **adamw_8bit**: Memory-efficient optimizer (8-bit quantized Adam)
- **max_steps**: Configurable training duration (100 steps ‚âà quick prototype)

### Training Strategy

Using **Supervised Fine-Tuning (SFT)** via the TRL library's `SFTTrainer`, optimized for instruction-following tasks.

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# ============================================================================
# CONFIGURABLE HYPERPARAMETERS
# ============================================================================

MAX_STEPS = 100  # üîß CHANGE THIS VALUE to train longer/shorter
                  # Typical values: 100-200 (quick test), 500-1000 (good results), 2000+ (production)

# ============================================================================

training_args = TrainingArguments(
    # Output & Logging
    output_dir="./outputs",
    run_name="sri-lankan-tour-guide-llama-3.1-8b",
    logging_dir="./logs",
    logging_steps=10,
    
    # Training Duration
    max_steps=MAX_STEPS,
    
    # Batch Size & Gradient Accumulation
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
    
    # Optimization
    learning_rate=2e-4,
    optim="adamw_8bit",  # 8-bit Adam optimizer for memory efficiency
    weight_decay=0.01,   # L2 regularization
    warmup_steps=10,     # Gradual learning rate warmup
    
    # Precision & Performance
    fp16=not torch.cuda.is_bf16_supported(),  # Use FP16 if BF16 not available
    bf16=torch.cuda.is_bf16_supported(),      # Use BF16 if supported (better for training)
    
    # Checkpointing
    save_strategy="steps",
    save_steps=50,
    save_total_limit=2,  # Keep only last 2 checkpoints to save disk space
    
    # Misc
    seed=42,
    report_to="none",  # Disable wandb/tensorboard for cleaner output
)

print("‚úÖ Training arguments configured")
print(f"\nüìä Training Configuration:")
print(f"   Max Steps: {MAX_STEPS}")
print(f"   Batch Size (per device): {training_args.per_device_train_batch_size}")
print(f"   Gradient Accumulation: {training_args.gradient_accumulation_steps}")
print(f"   Effective Batch Size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning Rate: {training_args.learning_rate}")
print(f"   Optimizer: {training_args.optim}")
print(f"   Precision: {'BF16' if training_args.bf16 else 'FP16'}")

### Initialize SFT Trainer

Creating the trainer with our model, dataset, and hyperparameters.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    args=training_args,
    dataset_text_field="text",  # Column containing the formatted prompts
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,  # Number of processes for data loading
    packing=False,  # Don't pack multiple examples into one sequence (cleaner for instruction tuning)
)

print("‚úÖ SFTTrainer initialized successfully")
print(f"üìä Training dataset size: {len(formatted_dataset)} examples")
print(f"üìä Estimated training time: ~{MAX_STEPS * 2 / 60:.1f} minutes (approximate)")

## Step 5a: Inference Testing - BEFORE Fine-Tuning

Testing the base model's performance on our target query to establish a baseline. This allows us to compare the improvements after fine-tuning.

In [None]:
# Test query
TEST_QUERY = "I want to visit Sigiriya but I am on a budget. Any advice?"

# Prepare the prompt in Alpaca format
test_instruction = f"{SYSTEM_PROMPT}\n\nYou are Travion, a friendly and knowledgeable Sri Lankan tour guide AI assistant. Help tourists plan their trips, provide cultural insights, and share local knowledge with warmth and authenticity."
test_prompt = ALPACA_PROMPT.format(test_instruction, TEST_QUERY, "")

print("=" * 80)
print("üß™ BASELINE INFERENCE - BEFORE FINE-TUNING")
print("=" * 80)
print(f"Query: {TEST_QUERY}\n")

# Set model to inference mode
FastLanguageModel.for_inference(model)

# Tokenize and generate
inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    use_cache=True,
)

# Decode and display
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# Extract only the response part (after "### Response:")
if "### Response:" in generated_text:
    response = generated_text.split("### Response:")[1].strip()
else:
    response = generated_text

print(f"Response (BEFORE):\n{response}")
print("=" * 80)

## Step 5b: Start Fine-Tuning

Beginning the training process. This cell will display real-time loss metrics and progress.

**Expected behavior:**
- Loss should decrease over time
- Training will take approximately 3-10 minutes for 100 steps on T4 GPU
- Checkpoints will be saved every 50 steps

In [None]:
import time

print("üöÄ Starting fine-tuning...\n")
start_time = time.time()

# Train the model
trainer_stats = trainer.train()

end_time = time.time()
training_duration = end_time - start_time

print(f"\n‚úÖ Training completed!")
print(f"‚è±Ô∏è  Total training time: {training_duration / 60:.2f} minutes")
print(f"üìä Final training loss: {trainer_stats.training_loss:.4f}")
print(f"üìà Steps completed: {trainer_stats.global_step}")

## Step 5c: Inference Testing - AFTER Fine-Tuning

Testing the fine-tuned model on the same query to evaluate improvements. We expect:
- More culturally appropriate responses ("Ayubowan" greetings)
- Better knowledge of Sri Lankan tourism
- More helpful budget-specific advice

In [None]:
print("=" * 80)
print("üß™ INFERENCE - AFTER FINE-TUNING")
print("=" * 80)
print(f"Query: {TEST_QUERY}\n")

# Set model to inference mode (IMPORTANT: re-apply for optimized inference)
FastLanguageModel.for_inference(model)

# Tokenize and generate with the same prompt
inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    use_cache=True,
)

# Decode and display
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
if "### Response:" in generated_text:
    response = generated_text.split("### Response:")[1].strip()
else:
    response = generated_text

print(f"Response (AFTER):\n{response}")
print("=" * 80)
print("\nüí° Compare the before/after responses to evaluate fine-tuning effectiveness!")

## Step 6: Model Export & Saving

### Export Options

We'll save the fine-tuned model in multiple formats:

1. **GGUF (q4_k_m)** - For local deployment with Ollama/llama.cpp
2. **Hugging Face Format** - For deployment with transformers library
3. **LoRA Adapters Only** - Lightweight format (just the fine-tuned weights)

### 6.1: Save as GGUF (Quantized for Ollama)

GGUF format enables running the model locally with tools like Ollama. The `q4_k_m` quantization provides a good balance between size and quality.

In [None]:
# Save the model in GGUF format with q4_k_m quantization
print("üíæ Saving model in GGUF format (q4_k_m quantization)...")
print("‚ö†Ô∏è  This may take 5-10 minutes...\n")

model.save_pretrained_gguf(
    "sri_lankan_tour_guide_gguf",  # Output directory
    tokenizer,
    quantization_method="q4_k_m",  # Quantization: q4_k_m (good balance of size/quality)
)

print("\n‚úÖ GGUF model saved to: ./sri_lankan_tour_guide_gguf/")
print("\nüì¶ To use with Ollama:")
print("   1. Download the .gguf file from the output directory")
print("   2. Create a Modelfile:")
print("      FROM ./model.gguf")
print(f"      SYSTEM '{SYSTEM_PROMPT}'")
print("   3. Run: ollama create sri-lankan-guide -f Modelfile")
print("   4. Run: ollama run sri-lankan-guide")

### 6.2: Save in Hugging Face Format

This format is compatible with the `transformers` library for deployment in Python applications.

In [None]:
# Save the full model in Hugging Face format (16-bit)
print("üíæ Saving model in Hugging Face format (16-bit)...\n")

model.save_pretrained(
    "sri_lankan_tour_guide_hf",
    tokenizer=tokenizer,
    save_method="merged_16bit",  # Merge LoRA weights and save in 16-bit
)

print("‚úÖ Hugging Face model saved to: ./sri_lankan_tour_guide_hf/")
print("\nüì¶ To use this model:")
print("   from transformers import AutoModelForCausalLM, AutoTokenizer")
print("   model = AutoModelForCausalLM.from_pretrained('./sri_lankan_tour_guide_hf')")
print("   tokenizer = AutoTokenizer.from_pretrained('./sri_lankan_tour_guide_hf')")

### 6.3: Save LoRA Adapters Only (Lightweight)

If you want to save only the fine-tuned weights (very small, ~50-100 MB), use this option. You'll need to load them on top of the base model later.

In [None]:
# Save only the LoRA adapter weights (lightweight)
print("üíæ Saving LoRA adapters only...\n")

model.save_pretrained("sri_lankan_tour_guide_lora")
tokenizer.save_pretrained("sri_lankan_tour_guide_lora")

print("‚úÖ LoRA adapters saved to: ./sri_lankan_tour_guide_lora/")
print("\nüì¶ To use LoRA adapters:")
print("   from unsloth import FastLanguageModel")
print("   model, tokenizer = FastLanguageModel.from_pretrained(")
print("       model_name='unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit',")
print("       adapter_dir='./sri_lankan_tour_guide_lora'")
print("   )")

### 6.4: Download Files to Local Machine

Zip and download the exported models for local use.

In [None]:
import shutil
from google.colab import files

# Create zip files for easy download
print("üì¶ Creating zip archives...\n")

# Zip GGUF model
if os.path.exists("sri_lankan_tour_guide_gguf"):
    shutil.make_archive("sri_lankan_tour_guide_gguf", 'zip', "sri_lankan_tour_guide_gguf")
    print("‚úÖ Created: sri_lankan_tour_guide_gguf.zip")

# Zip LoRA adapters
if os.path.exists("sri_lankan_tour_guide_lora"):
    shutil.make_archive("sri_lankan_tour_guide_lora", 'zip', "sri_lankan_tour_guide_lora")
    print("‚úÖ Created: sri_lankan_tour_guide_lora.zip")

print("\nüì• Download the files below:")
print("   - For Ollama: Download sri_lankan_tour_guide_gguf.zip")
print("   - For lightweight deployment: Download sri_lankan_tour_guide_lora.zip")
print("\nüí° Tip: The GGUF file is recommended for local deployment with Ollama")

---

## üéâ Training Complete!

### Summary

You've successfully fine-tuned a Llama 3.1 8B model for Sri Lankan tourism assistance. The model now:

‚úÖ Responds with culturally appropriate greetings ("Ayubowan")  
‚úÖ Provides accurate Sri Lankan travel advice  
‚úÖ Maintains a warm, hospitable tone  
‚úÖ Handles budget-conscious queries effectively  

### Next Steps

1. **Evaluate Performance**: Test the model with diverse queries to ensure quality
2. **Deploy Locally**: Use the GGUF file with Ollama for local testing
3. **Iterate**: If needed, adjust hyperparameters and retrain:
   - Increase `MAX_STEPS` for better convergence (try 500-1000)
   - Adjust `learning_rate` if loss plateaus
   - Expand your dataset with more diverse examples

### Production Considerations

- **Monitoring**: Track model performance with real user queries
- **Safety**: Implement content filtering for inappropriate outputs
- **Versioning**: Tag your models with version numbers and training metadata
- **A/B Testing**: Compare fine-tuned vs. base model in production

---

**Questions or issues?** Check:
- Unsloth documentation: https://github.com/unslothai/unsloth
- Hugging Face TRL: https://github.com/huggingface/trl
- Ollama documentation: https://ollama.ai/docs

Happy fine-tuning! üöÄüá±üá∞