To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [None]:
!pip install -q "huggingface-hub>=0.34.0,<1.0" "transformers>=4.36,<4.58"

!pip install -q unsloth wandb python-dotenv


In [None]:
import os
import wandb
from huggingface_hub import HfFolder, login

try:
    from google.colab import drive, userdata
    IN_COLAB = True
    IN_KAGGLE = False
except ImportError:
    IN_COLAB = False
    # Check if running on Kaggle
    IN_KAGGLE = os.path.exists('/kaggle')

if IN_COLAB:
    print("Running in Google Colab")
    from IPython.display import clear_output
    !pip install unsloth
    !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git
    !pip install wandb
    clear_output()
    
    drive.mount('/content/drive')
    
    HF_TOKEN = userdata.get('hf_token')
    WANDB_API_KEY = userdata.get('wandb_token')
    
elif IN_KAGGLE:
    print("Running on Kaggle")
    from kaggle_secrets import UserSecretsClient
    secrets = UserSecretsClient()
    
    HF_TOKEN = secrets.get_secret('hf_token')
    WANDB_API_KEY = secrets.get_secret('wandb_token')
    
else:
    print("Running locally")
    from dotenv import load_dotenv
    load_dotenv()
    
    HF_TOKEN = os.getenv("HF_TOKEN")
    WANDB_API_KEY = os.getenv("WANDB_API_KEY")

# Login
if HF_TOKEN:
    login(token=HF_TOKEN)
    HfFolder.save_token(HF_TOKEN)

if WANDB_API_KEY:
    wandb.login(key=WANDB_API_KEY)


In [None]:
# WandB Project Configuration
import datetime

# Project name for WandB tracking
WANDB_PROJECT = "uncategorized"

# Optional: Add timestamp to run names for easy identification
TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d-%H%M")

print(f"üìä WandB Project: {WANDB_PROJECT}")
print(f"üïê Timestamp: {TIMESTAMP}")


In [None]:
import os, torch
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()


In [None]:
if IN_COLAB:
    output_dir = "/content/drive/MyDrive/my-model-checkpoints"
elif IN_KAGGLE:
    output_dir = "/kaggle/working/my-model-checkpoints"
else:
    output_dir = "my-model-checkpoints"
    
if not os.path.exists(output_dir):
    os.makedirs(output_dir)


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 768 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-0528-Qwen3-8B-unsloth-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    device_map={"": "cuda:0"},
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
from datasets import load_dataset
data = load_dataset("mlabonne/FineTome-100k", split = "train")

# For hyperparameter sweeps on Colab T4 free tier:
# Use 15% of data (~12k samples) to fit multiple runs in 12-hour session
# For final training with best params, use full dataset
USE_FULL_DATASET = False  # Set to True for final run with best hyperparameters

if not USE_FULL_DATASET:
    print("Using 15% of dataset for hyperparameter sweep (Colab T4 optimized)")
    data = data.train_test_split(test_size=0.85, seed=42)["train"]
else:
    print("Using full dataset")

splits = data.train_test_split(test_size=0.1, seed=42)
train_valid = splits["train"].train_test_split(test_size=0.1, seed=42)
dataset = train_valid["train"]
valid_dataset = train_valid["test"]
test_dataset = splits["test"]

print(f"Training samples: {len(dataset)}")


In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-3",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    for convo in convos:
        txt = tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        # Truncate BEFORE tokenizing
        ids = tokenizer(txt, truncation=True, max_length=max_seq_length)["input_ids"]
        texts.append(tokenizer.decode(ids))
    return { "text": texts }


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]["conversations"]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only
import wandb
import torch
import gc
import time
from datetime import datetime

# Updated sweep config with overfitting prevention
sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'eval/loss', 'goal': 'minimize'},  # Changed to eval/loss!
    'parameters': {
        'learning_rate': {'min': 5e-6, 'max': 2e-4},  # Lower LR range
        'gradient_accumulation_steps': {'values': [4, 8]},
        'lora_dropout': {'values': [0.05, 0.1]},  # Add dropout
        'weight_decay': {'values': [0.01, 0.05]},  # Add weight decay
    }
}

def train_func():
    # Clean up memory before starting
    gc.collect()
    torch.cuda.empty_cache()
    
    # Record start time for duration tracking
    start_time = datetime.now()
    
    # Initialize wandb (config is automatically provided by sweep)
    # The agent automatically connects this to the sweep, but we specify project/entity for clarity
    wandb.init(
        project=WANDB_PROJECT,
        entity="hayleyc-kth-royal-institute-of-technology"
    )
    cfg = wandb.config
    
    # Reload model to ensure fresh weights for each run
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/DeepSeek-R1-0528-Qwen3-8B-unsloth-bnb-4bit",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = cfg.lora_dropout,
        bias = "none",
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
        use_rslora = False,
        loftq_config = None,
    )

    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = dataset,
        eval_dataset = valid_dataset,  # Add validation dataset!
        dataset_text_field = "text",
        max_seq_length = max_seq_length,
        data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer),
        args = TrainingArguments(
            per_device_train_batch_size = 1,
            per_device_eval_batch_size = 1,
            gradient_accumulation_steps = cfg.gradient_accumulation_steps,
            warmup_steps = 10,
            max_steps = 1500,  # Stop before overfitting (around step 2000)
            learning_rate = cfg.learning_rate,
            weight_decay = cfg.weight_decay,  # Regularization
            fp16 = not is_bfloat16_supported(),
            bf16 = is_bfloat16_supported(),
            logging_steps = 10,
            eval_strategy = "steps",  # Enable evaluation
            eval_steps = 100,  # Evaluate every 100 steps
            optim = "adamw_8bit",
            output_dir = output_dir,
            save_strategy = "steps",
            save_steps = 500,
            save_total_limit = 2,
            load_best_model_at_end = True,  # Load best checkpoint
            metric_for_best_model = "eval_loss",  # Use eval loss
            greater_is_better = False,
            report_to = "wandb",
            seed = 3407
        ),
    )
        
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<|im_start|>user\n",
        response_part = "<|im_start|>assistant\n",
    )
    
    trainer.train()
    
    # Clean up
    del model
    del trainer
    del tokenizer
    gc.collect()
    torch.cuda.empty_cache()
    end_time = datetime.now()
    duration = end_time - start_time
    print(f"\n‚è±Ô∏è Sweep run completed!")
    print(f"   Start:    {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"   End:      {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"   Duration: {duration}")

# Initialize sweep
# Initialize sweep with descriptive name
if not USE_FULL_DATASET:
    
    sweep_id = wandb.sweep(
        sweep_config, 
        project=WANDB_PROJECT,
        entity="hayleyc-kth-royal-institute-of-technology"
    )

    # Run 3 sweeps (optimized for 12-hour Colab session)
    wandb.agent(
        sweep_id, 
        function=train_func, 
        count=3, 
        project=WANDB_PROJECT,
        entity="hayleyc-kth-royal-institute-of-technology"
    )


---

# üéØ Final Training with Best Hyperparameters

## When to run this section:
1. ‚úÖ After all 3 sweep runs complete
2. ‚úÖ After reviewing WandB to find best hyperparameters
3. ‚úÖ Set `USE_FULL_DATASET = True` above

## What this does:
- Trains on **full 81k samples** (not 15% subsample)
- Uses **best hyperparameters** from sweep
- **Early stopping** to prevent overfitting
- Automatically **uploads to Hugging Face** when done
- Takes ~20-25 hours on T4 GPU (may stop earlier if converged)

## üõë Early Stopping:
Training will automatically stop if:
- No improvement in eval/loss for 3 consecutive evaluations (1500 steps)
- Minimum improvement threshold: 0.01
- Prevents wasting compute on overfitting

## ‚ö†Ô∏è Before running:
1. Check WandB sweep results
2. Best hyperparameters will be auto-fetched
3. Ensure you have enough GPU time remaining


In [None]:
# Only run this after sweeps complete and USE_FULL_DATASET = True
if USE_FULL_DATASET:
    from datetime import datetime
    
    final_training_start = datetime.now()
    print(f"üïê Final training started at: {final_training_start.strftime('%Y-%m-%d %H:%M:%S')}")
    
    print("üöÄ Starting final training with full dataset...")
    
    # Check for existing checkpoint to resume from
    import glob
    checkpoints = glob.glob(f"{output_dir}/checkpoint-*")
    resume_from_checkpoint = None
    
    if checkpoints:
        # Sort by step number and get the latest
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        resume_from_checkpoint = checkpoints[-1]
        print(f"üìÇ Found checkpoint: {resume_from_checkpoint}")
        print("   Resuming training from this checkpoint...")
    else:
        print("üìÇ No checkpoint found. Starting fresh training.")
    
    # Automatically get best hyperparameters from WandB sweep
    print("\nüìä Fetching best hyperparameters from WandB sweep...")
    
    api = wandb.Api()
    sweeps = api.project(
        WANDB_PROJECT,
        entity="hayleyc-kth-royal-institute-of-technology"
    ).sweeps()
    
    latest_sweep = list(sweeps)[0]
    print(f"Found sweep: {latest_sweep.name} ({latest_sweep.id})")
    
    best_run = latest_sweep.best_run()
    
    if best_run:
        BEST_LEARNING_RATE = best_run.config.get('learning_rate', 1e-4)
        BEST_GRAD_ACCUM = best_run.config.get('gradient_accumulation_steps', 8)
        BEST_LORA_DROPOUT = best_run.config.get('lora_dropout', 0.05)
        BEST_WEIGHT_DECAY = best_run.config.get('weight_decay', 0.01)
        
        best_eval_loss = best_run.summary.get('eval/loss', 'N/A')
        
        print(f"\n‚úÖ Best hyperparameters from run '{best_run.name}':")
        print(f"   Learning Rate: {BEST_LEARNING_RATE}")
        print(f"   Gradient Accumulation: {BEST_GRAD_ACCUM}")
        print(f"   LoRA Dropout: {BEST_LORA_DROPOUT}")
        print(f"   Weight Decay: {BEST_WEIGHT_DECAY}")
        print(f"   Best eval/loss: {best_eval_loss}")
    else:
        print("‚ö†Ô∏è No best run found. Using default values.")
        BEST_LEARNING_RATE = 1e-4
        BEST_GRAD_ACCUM = 8
        BEST_LORA_DROPOUT = 0.05
        BEST_WEIGHT_DECAY = 0.01
    
    # Initialize WandB
    wandb.init(
        project=WANDB_PROJECT,
        entity="hayleyc-kth-royal-institute-of-technology",
        name=f"final-training-{TIMESTAMP}",
        resume="allow",  # Allow resuming if run exists
        config={
            'learning_rate': BEST_LEARNING_RATE,
            'gradient_accumulation_steps': BEST_GRAD_ACCUM,
            'lora_dropout': BEST_LORA_DROPOUT,
            'weight_decay': BEST_WEIGHT_DECAY,
            'dataset_size': len(dataset),
            'source': 'auto-selected from sweep',
            'resumed_from_checkpoint': resume_from_checkpoint is not None,
        }
    )
    
    # Load model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/DeepSeek-R1-0528-Qwen3-8B-unsloth-bnb-4bit",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    
    # Add LoRA
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = BEST_LORA_DROPOUT,
        bias = "none",
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
        use_rslora = False,
        loftq_config = None,
    )
    
    # Create trainer with checkpoint settings
    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = dataset,
        eval_dataset = valid_dataset,
        dataset_text_field = "text",
        max_seq_length = max_seq_length,
        data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer),
        callbacks = [EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
        args = TrainingArguments(
            per_device_train_batch_size = 2,
            per_device_eval_batch_size = 2,
            gradient_accumulation_steps = BEST_GRAD_ACCUM,
            warmup_steps = 10,
            num_train_epochs = 1,
            learning_rate = BEST_LEARNING_RATE,
            weight_decay = BEST_WEIGHT_DECAY,
            fp16 = not is_bfloat16_supported(),
            bf16 = is_bfloat16_supported(),
            logging_steps = 10,
            eval_strategy = "steps",
            eval_steps = 500,
            # Early stopping configuration
            early_stopping_patience = 3,  # Stop if no improvement for 3 evals
            early_stopping_threshold = 0.01,  # Minimum improvement threshold
            optim = "adamw_8bit",
            output_dir = output_dir,
            # Checkpoint settings
            save_strategy = "steps",
            save_steps = 500,  # Save every 500 steps
            save_total_limit = 3,  # Keep last 3 checkpoints
            # Resume settings
            load_best_model_at_end = True,
            metric_for_best_model = "eval_loss",
            greater_is_better = False,
            report_to = "wandb",
            seed = 3407
        ),
    )
    
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<|im_start|>user\n",
        response_part = "<|im_start|>assistant\n",
    )
    
    # Train (will resume from checkpoint if found)
    print("\nüéØ Training started...")
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
    
    # Get final metrics
    final_metrics = trainer.evaluate()
    final_eval_loss = final_metrics['eval_loss']
    print(f"\n‚úÖ Training complete! Final eval_loss: {final_eval_loss:.4f}")
    
    # Check if model is good enough to upload
    if final_eval_loss < 1.5:  # Reasonable threshold
        print("\nüì§ Model quality is good! Uploading to Hugging Face...")
        print("   This will take ~45-60 minutes for all formats.\n")
        
        # 1. Upload LoRA adapters (fastest, for experimentation)
        print("[1/3] Uploading LoRA adapters...")
        model.push_to_hub("hayitsmaddy/mamamadal", token=HF_TOKEN)
        tokenizer.push_to_hub("hayitsmaddy/mamamadal", token=HF_TOKEN)
        print("      ‚úÖ LoRA adapters uploaded (~200MB)")
        
        # 2. Upload merged 4bit (best for web service deployment)
        print("\n[2/3] Creating and uploading merged 4bit model (recommended for web service)...")
        model.push_to_hub_merged(
            "hayitsmaddy/mamamadal",
            tokenizer,
            save_method="merged_4bit",
            token=HF_TOKEN
        )
        print("      ‚úÖ Merged 4bit model uploaded (~4GB)")
        print("      üí° This is a standalone model, no base model needed!")
        
        # 3. Upload GGUF formats (for Ollama, LM Studio, local deployment)
        print("\n[3/3] Creating and uploading GGUF formats (for local deployment)...")
        model.push_to_hub_gguf(
            "hayitsmaddy/mamamadal",
            tokenizer,
            quantization_method=["q4_k_m", "q5_k_m"],  # Good balance
            token=HF_TOKEN
        )
        print("      ‚úÖ GGUF formats uploaded (Q4_K_M ~4GB, Q5_K_M ~5GB)")
        print("      üí° Use these for Ollama, LM Studio, or llama.cpp")
        
        print("\n" + "="*60)
        print("üéâ ALL UPLOADS COMPLETE!")
        print("="*60)
        print("\nüì¶ Your model repository now contains:")
        print("   1. LoRA adapters (for experimentation)")
        print("   2. Merged 4bit model (for web service deployment)")
        print("   3. GGUF Q4_K_M & Q5_K_M (for local deployment)")
        print("\nüîó View at: https://huggingface.co/hayitsmaddy/mamamadal")
        print("\nüí° For your web chatbot service:")
        print("   - Use the merged 4bit model")
        print("   - Load with: AutoModelForCausalLM.from_pretrained('hayitsmaddy/mamamadal')")
        print("   - No need to load base model separately!")
        
    else:
        print(f"\n‚ö†Ô∏è Model quality not ideal (eval_loss: {final_eval_loss:.4f})")
        print("   Recommended threshold: < 1.5")
        print("   Uploading only LoRA adapters for review...")
        model.push_to_hub("hayitsmaddy/mamamadal", token=HF_TOKEN)
        tokenizer.push_to_hub("hayitsmaddy/mamamadal", token=HF_TOKEN)
        print("\nüí° Consider:")
        print("   - Adjusting hyperparameters")
        print("   - Training for more steps")
        print("   - Checking for data quality issues")
    
    
    # Get final metrics
    
    print("\nüì§ Uploading to Hugging Face Hub...")
    
    wandb.finish()
    wandb.finish()
    
    final_training_end = datetime.now()
    final_duration = final_training_end - final_training_start
    
    print("\nüéâ Final training complete!")
    print(f"\n‚è±Ô∏è Training Time:")
    print(f"   Start:    {final_training_start.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"   End:      {final_training_end.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"   Duration: {final_duration}")
    hours = final_duration.total_seconds() / 3600
    print(f"   ({hours:.2f} hours)")
    
    # Model and tokenizer are now available globally for inference
    print("\nüí° Model and tokenizer are ready for testing!")
    print("   You can now run the inference cells below.")
    
    # Now trainer exists and you can run verification cells below
    print("\nüí° Tip: You can now run the verification cells below to check training masking.")
else:
    print("‚ö†Ô∏è Skipping final training. Set USE_FULL_DATASET = True to run.")


In [None]:
# Verify model and tokenizer are available
if 'model' in globals() and 'tokenizer' in globals():
    print("‚úÖ Model and tokenizer are loaded and ready!")
    print(f"   Model type: {type(model).__name__}")
    print(f"   Tokenizer type: {type(tokenizer).__name__}")
    print("\nüí° You can now run inference cells to test the model.")
else:
    print("‚ö†Ô∏è Model or tokenizer not found.")
    print("   Make sure final training completed successfully.")


## üìä How to Verify Model Performance

### During Sweeps (15% dataset):

**Check WandB Dashboard:**
1. Go to your sweep page
2. Look at **Parallel Coordinates** plot
3. Compare `eval/loss` across runs
4. **Good signs:**
   - `eval/loss` decreasing steadily
   - `train/loss` and `eval/loss` stay close
   - No sudden spikes or divergence

**Best run criteria:**
- ‚úÖ Lowest `eval/loss`
- ‚úÖ Stable training (no spikes)
- ‚úÖ Small gap between train/eval loss

### After Final Training (Full dataset):

**Quantitative checks:**
1. **Final `eval/loss`**: Should be < 1.0 (lower is better)
2. **Train vs Eval gap**: Should be < 0.2 (not overfitting)
3. **Loss curve**: Smooth downward trend

**Qualitative checks (run inference below):**
1. Test with sample questions
2. Check if responses are coherent
3. Verify it follows instructions
4. Compare to base model

### üéØ Decision Guide:

**Upload to HF if:**
- ‚úÖ `eval/loss` < 1.0
- ‚úÖ Responses look good in inference tests
- ‚úÖ No overfitting (train/eval gap small)

**Don't upload if:**
- ‚ùå `eval/loss` > 1.5
- ‚ùå Responses are incoherent
- ‚ùå Large train/eval gap (overfitting)
- ‚ùå Loss didn't decrease much


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

We verify masking is actually done:

## üîç Verifying Training Data Masking

**What's happening here:**
These cells verify that `train_on_responses_only` is working correctly.

**Why this matters:**
- We only want to train on the **assistant's responses**, not the user's questions
- This prevents the model from learning to generate questions instead of answers
- The system/instruction prompts should be **masked** (ignored during training)

**What to look for:**
1. **First cell**: Shows the full conversation with special tokens
2. **Second cell**: Shows what parts are actually trained on (masked parts appear as spaces)
3. **Result**: Only the assistant's responses should be visible in the second output


## ‚ö†Ô∏è Training Verification Cells (Currently Disabled)

**Note**: These cells are commented out because `trainer` only exists inside the sweep function.

**To verify training masking:**
1. After sweeps complete, set `USE_FULL_DATASET = True`
2. The final training section will create a `trainer` object
3. Then you can run these verification cells

**What these cells do:**
- Show the full conversation with tokens
- Show which parts are masked (not trained on)
- Verify only assistant responses are being trained


In [None]:
# Verify training data format (run after final training)
# Shows what the model sees during training
if 'trainer' in globals():
    print("Sample training example:")
    print(tokenizer.decode(trainer.train_dataset[5]["input_ids"]))
else:
    print("‚ö†Ô∏è Run this after final training completes (trainer needs to exist)")


We can see the System and Instruction prompts are successfully masked!

## üíª GPU Memory Check

Run this cell to check GPU memory usage.
Useful for:
- Verifying you have enough memory
- Debugging OOM errors
- Optimizing batch sizes


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

---

# üß™ Model Testing & Inference

## ‚ö†Ô∏è Run these cells AFTER final training completes!

These cells test your fine-tuned model with sample prompts.

### What to check:
1. **Response quality**: Are answers coherent and relevant?
2. **Instruction following**: Does it follow the prompt?
3. **Comparison**: How does it compare to base model?

### Testing workflow:
1. Run inference examples below
2. Try your own prompts
3. If quality is good ‚Üí Model is ready for deployment!
4. If quality is poor ‚Üí Check eval/loss, consider retraining

---


In [None]:
# Run this after final training completes
if 'model' not in globals():
    print("‚ö†Ô∏è Model not loaded yet!")
    print("   Set USE_FULL_DATASET = True and run final training first.")
else:
    # Test 1: Basic inference (no streaming)
    from unsloth.chat_templates import get_chat_template
    
    tokenizer = get_chat_template(tokenizer, chat_template="qwen-3")
    FastLanguageModel.for_inference(model)  # Enable 2x faster inference
    
    # Test prompt
    messages = [
        {"role": "user", "content": "Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")
    
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=64,
        use_cache=True,
        temperature=1.5,
        min_p=0.1
    )
    
    response = tokenizer.batch_decode(outputs)
    print("Model response:")
    print(response[0])


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# Run this after final training completes
if 'model' not in globals():
    print("‚ö†Ô∏è Model not loaded yet!")
    print("   Set USE_FULL_DATASET = True and run final training first.")
else:
    # Test 2: Streaming inference (see tokens as they generate)
    from transformers import TextStreamer
    
    FastLanguageModel.for_inference(model)
    
    messages = [
        {"role": "user", "content": "Explain what fine-tuning is in simple terms."},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    print("Model response (streaming):")
    _ = model.generate(
        input_ids=inputs,
        streamer=text_streamer,
        max_new_tokens=128,
        use_cache=True,
        temperature=1.5,
        min_p=0.1
    )


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

## üì§ When to Upload to Hugging Face Hub

### ‚ö†Ô∏è IMPORTANT: Do NOT run this cell during hyperparameter sweeps!

**Upload timing guide:**

### ‚ùå DON'T Upload:
- During the 3 sweep runs (these are experiments, not final models)
- After each individual sweep run
- Before you've identified the best hyperparameters

### ‚úÖ DO Upload:
1. **After sweep completes** and you've identified best hyperparameters
2. **After final training** with `USE_FULL_DATASET = True`
3. **When eval/loss is good** and you're satisfied with results

### üìã Upload Workflow:
```
1. Run 3 sweep experiments ‚Üí Don't upload
2. Check WandB, find best hyperparameters ‚Üí Don't upload
3. Set USE_FULL_DATASET = True ‚Üí Don't upload
4. Train with best hyperparameters on full data ‚Üí Don't upload
5. Verify final model performance ‚Üí NOW UPLOAD! ‚úÖ
```

### üéØ What this cell does:
- **LoRA adapters only**: Uploads just the trained adapter weights (~100-200MB)
- **Not the full model**: Base model stays on HF, adapters are loaded on top
- **Repository**: `hayitsmaddy/mamamadal`
- **Private**: You can change this in repo settings

### üí° To use this cell:
1. Make sure you're happy with the model's performance
2. Uncomment the lines (remove `# ` at the start)
3. Run the cell
4. Check https://huggingface.co/hayitsmaddy/mamamadal


---

# üíæ Model Saving Options

## ‚ö†Ô∏è IMPORTANT: Only run these AFTER final training completes!

The final training cell above already uploads LoRA adapters automatically.
These cells are for:
1. **Local backups** (save to disk)
2. **Different formats** (merged models, GGUF for deployment)
3. **Manual control** (if you want to save specific versions)

---

## üì¶ Option 1: LoRA Adapters (Recommended)

**What it saves:**
- Only the trained adapter weights (~100-200MB)
- Requires base model to use

**When to use:**
- ‚úÖ Sharing your fine-tune
- ‚úÖ Quick uploads
- ‚úÖ Experimenting with different adapters

**Note:** Final training already does this automatically!


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
# Run this after final training completes
if 'model' not in globals():
    print("‚ö†Ô∏è Model not loaded yet!")
    print("   Set USE_FULL_DATASET = True and run final training first.")
else:
    # Test 2: Streaming inference (see tokens as they generate)
    from transformers import TextStreamer
    
    FastLanguageModel.for_inference(model)
    
    messages = [
        {"role": "user", "content": "Explain what fine-tuning is in simple terms."},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    print("Model response (streaming):")
    _ = model.generate(
        input_ids=inputs,
        streamer=text_streamer,
        max_new_tokens=128,
        use_cache=True,
        temperature=1.5,
        min_p=0.1
    )


## üîÑ Advanced: Merging and Converting Models

**‚ö†Ô∏è Only run these after final training, not during sweeps!**

### What are these options?

**LoRA Adapters (Default - Recommended)**
- ‚úÖ Small file size (~100-200MB)
- ‚úÖ Fast upload
- ‚úÖ Easy to share
- ‚ùå Requires base model to use

**Merged 16bit (Full Model)**
- ‚úÖ Standalone model, no base model needed
- ‚úÖ Full precision
- ‚ùå Large file size (~16GB)
- ‚ùå Slow upload

**Merged 4bit (Quantized)**
- ‚úÖ Smaller than 16bit (~4GB)
- ‚úÖ Faster inference
- ‚ùå Slight quality loss

**GGUF (For llama.cpp)**
- ‚úÖ Works with Ollama, LM Studio, llama.cpp
- ‚úÖ Multiple quantization options
- ‚úÖ Best for local deployment

### üí° Recommendation:
1. **For sharing**: Upload LoRA adapters (lightest, easiest)
2. **For deployment**: Convert to GGUF q4_k_m (good balance)
3. **For production**: Merge to 16bit (best quality)

### üéØ To use:
Change `if False:` to `if True:` for the option you want


# OPTIONAL: Additional merged model formats
The merged 4bit model is already uploaded automatically!
Only run these if you specifically need 16bit precision

# Merge to 16bit (Full precision, ~16GB) - OPTIONAL
if False:
    print("Uploading merged 16bit model...")
    model.push_to_hub_merged("hayitsmaddy/mamamadal", tokenizer, save_method="merged_16bit", token=HF_TOKEN)
    print("‚úÖ Uploaded 16bit merged model")


In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hayitsmaddy/mamamadal", tokenizer, save_method = "merged_16bit", token = HF_TOKEN)

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hayitsmaddy/mamamadal", tokenizer, save_method = "merged_4bit", token = HF_TOKEN)

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hayitsmaddy/mamamadal", tokenizer, save_method = "lora", token = HF_TOKEN)

# OPTIONAL: Additional GGUF formats
Q4_K_M and Q5_K_M are already uploaded automatically!
Only run these if you specifically need higher quality GGUF

# Q8_0 (8-bit, ~8GB, excellent quality) - OPTIONAL
if False:
    model.push_to_hub_gguf("hayitsmaddy/mamamadal", tokenizer, quantization_method="q8_0", token=HF_TOKEN)
    print("‚úÖ Uploaded Q8_0 GGUF")

# F16 (16-bit, ~16GB, maximum quality) - OPTIONAL
if False:
    model.push_to_hub_gguf("hayitsmaddy/mamamadal", tokenizer, quantization_method="f16", token=HF_TOKEN)
    print("‚úÖ Uploaded F16 GGUF")


In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hayitsmaddy/mamamadal", tokenizer, token = HF_TOKEN)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hayitsmaddy/mamamadal", tokenizer, quantization_method = "f16", token = HF_TOKEN)

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hayitsmaddy/mamamadal", tokenizer, quantization_method = "q4_k_m", token = HF_TOKEN)

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hayitsmaddy/mamamadal", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = HF_TOKEN, # Get a token at https://huggingface.co/settings/tokens
    )

In [None]:
# Optional: Save GGUF locally (only run after final training)
# This is already uploaded to HuggingFace automatically!
# Only run this if you want a local copy

if 'model' in globals() and USE_FULL_DATASET:
    print("Saving GGUF format locally...")
    model.save_pretrained_gguf(
        "model-gguf",
        tokenizer,
        quantization_method="q4_k_m"
    )
    print("‚úÖ Saved to: model-gguf/")
else:
    print("‚ö†Ô∏è Skipping local GGUF save (model not trained yet or not using full dataset)")


---

## ‚úÖ What Gets Uploaded Automatically

After final training completes successfully, the notebook automatically uploads:

### 1. LoRA Adapters (~200MB)
- Lightweight adapter weights
- For experimentation and sharing

### 2. Merged 4bit Model (~4GB) ‚≠ê
- **Use this for your web chatbot service!**
- Standalone model (no base model needed)
- Good quality, fast inference
- Perfect for CPU inference on Streamlit/Gradio

### 3. GGUF Formats (~4-5GB)
- Q4_K_M and Q5_K_M
- For Ollama, LM Studio, llama.cpp
- Optimized for CPU inference

---

## üöÄ Using Your Model

### In Streamlit/Gradio (CPU inference):
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    'hayitsmaddy/mamamadal',
    device_map='cpu'  # CPU inference
)
tokenizer = AutoTokenizer.from_pretrained('hayitsmaddy/mamamadal')
```

### With GGUF (llama.cpp):
```bash
# Download GGUF from HuggingFace
# Use with Ollama or llama.cpp for fast CPU inference
```

---


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with ü§ó HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>