To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git
!pip install -U "transformers" "huggingface_hub"

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

## Configuration
Define all tunable parameters in one place so you can quickly reproduce or tweak experiments without hunting through the notebook.


In [None]:
from dataclasses import dataclass, asdict

###### model getting saved in google drive

@dataclass
class TrainConfig:
    base_model: str = "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length: int = 2048
    dtype: str | None = None
    load_in_4bit: bool = True
    dataset_name: str = "mlabonne/FineTome-100k"
    dataset_split: str = "train"
    subset_size: int | None = None  # Set to None to use the full split
    eval_subset_size: int = 512
    #output_dir: str = "outputs"
    output_dir: str = "/content/drive/MyDrive/unsloth_lora_model"
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 8
    learning_rate: float = 3e-4
    warmup_steps: int = 5
    warmup_ratio: float = 0.03
    lr_scheduler_type: str = "linear"
    weight_decay: float = 0.01
    max_steps: int = -1
    num_train_epochs: int = 1
    logging_steps: int = 1
    save_strategy: str = "steps"
    save_steps: int = 50
    save_total_limit: int = 2
    optim: str = "adamw_8bit"
    seed: int = 0
    lora_r: int = 32
    lora_alpha: int = 64
    lora_dropout: float = 0.0
    use_gradient_checkpointing: str | bool = "unsloth"

CONFIG = TrainConfig()

cfg = CONFIG
print(f"Plan: Training on {CONFIG.subset_size}")
print("Using configuration:", asdict(CONFIG))


Plan: Training on None
Using configuration: {'base_model': 'unsloth/Llama-3.2-1B-Instruct', 'max_seq_length': 2048, 'dtype': None, 'load_in_4bit': True, 'dataset_name': 'mlabonne/FineTome-100k', 'dataset_split': 'train', 'subset_size': None, 'eval_subset_size': 512, 'output_dir': '/content/drive/MyDrive/unsloth_lora_model', 'per_device_train_batch_size': 2, 'gradient_accumulation_steps': 8, 'learning_rate': 0.0003, 'warmup_steps': 5, 'warmup_ratio': 0.03, 'lr_scheduler_type': 'linear', 'weight_decay': 0.01, 'max_steps': -1, 'num_train_epochs': 1, 'logging_steps': 1, 'save_strategy': 'steps', 'save_steps': 50, 'save_total_limit': 2, 'optim': 'adamw_8bit', 'seed': 0, 'lora_r': 32, 'lora_alpha': 64, 'lora_dropout': 0.0, 'use_gradient_checkpointing': 'unsloth'}


In [None]:
from unsloth import FastLanguageModel
import torch

cfg = CONFIG

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit"  # NEW! Llama 3.3 70B!
]  # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = cfg.base_model,
    max_seq_length = cfg.max_seq_length,
    dtype = cfg.dtype,
    load_in_4bit = cfg.load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = cfg.lora_r,  # Suggested: 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = cfg.lora_alpha,
    lora_dropout = cfg.lora_dropout,  # 0 is fastest but configurable
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = cfg.use_gradient_checkpointing,
    random_state = cfg.seed,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.6 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
# Override configuration for quick debugging
# Run on a smaller subset with fewer training steps
#cfg.subset_size = 1_000  # smaller subset for quick tests
#cfg.max_steps = 50       # small number of training steps



In [None]:
print(f"Dataset Size: {cfg.subset_size}")
print(f"Max Steps: {cfg.max_steps}")
# Should print: 10000 and -1

Dataset Size: None
Max Steps: -1


In [None]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize = False,
            add_generation_prompt = False,
        )
        for convo in convos
    ]
    return {"text": texts}


dataset = load_dataset(cfg.dataset_name, split = cfg.dataset_split)
if cfg.subset_size:
    dataset = dataset.select(range(min(cfg.subset_size, len(dataset))))

# Re-format
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True)

print(f"‚úÖ Ready! Dataset size is now: {len(dataset)}")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

‚úÖ Ready! Dataset size is now: 100000


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
"""
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)
"""

'\nfrom unsloth.chat_templates import standardize_sharegpt\ndataset = standardize_sharegpt(dataset)\ndataset = dataset.map(formatting_prompts_func, batched = True,)\n'

We look at how the conversations are structured for item 5:

In [None]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [None]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

## New Section
# Hyperparameter Optimization (Successive Halving), run only for changing the hyperparameters

This optional section performs a multi‚Äëfidelity hyperparameter search using Successive Halving (SHA). It trains a few configurations on a small subset of the data for a handful of steps and reports the best learning rate and LoRA rank. Run this cell after completing the main training to explore hyperparameter effects.

In [None]:
####### optimization, run only if you want to change the hyperparameters
####

do_hyper_search = True


from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch, gc
from unsloth.chat_templates import train_on_responses_only

# Helper to drop labels column
def drop_labels_column(ds):
    if ds is None: return ds
    if hasattr(ds, "column_names") and "labels" in ds.column_names:
        return ds.remove_columns("labels")
    return ds

def train_once_hpo(config, max_steps, train_dataset, eval_dataset, base_cfg):
    """
    Trains a model for a short duration to evaluate hyperparams.
    """
    # 1. Extract params (merging specific config with global defaults)
    lr = config.get("learning_rate", base_cfg.learning_rate)
    lora_r = config.get("lora_r", base_cfg.lora_r)
    lora_alpha = config.get("lora_alpha", base_cfg.lora_alpha)
    lora_dropout = config.get("lora_dropout", base_cfg.lora_dropout)
    weight_decay = config.get("weight_decay", base_cfg.weight_decay)
    warmup_ratio = config.get("warmup_ratio", base_cfg.warmup_ratio)
    batch_size = config.get("per_device_train_batch_size", base_cfg.per_device_train_batch_size)
    grad_acc = config.get("gradient_accumulation_steps", base_cfg.gradient_accumulation_steps)
    use_rslora = config.get("use_rslora", False) # Support for rslora param

    print(f"--> Training {config['name']} | LR: {lr} | R: {lora_r} | Alpha: {lora_alpha}")

    # 2. Load fresh model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = base_cfg.base_model,
        max_seq_length = base_cfg.max_seq_length,
        dtype = base_cfg.dtype,
        load_in_4bit = base_cfg.load_in_4bit,
    )

    # 3. Add LoRA
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_r,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj"],
        lora_alpha = lora_alpha,
        lora_dropout = lora_dropout,
        bias = "none",
        use_gradient_checkpointing = base_cfg.use_gradient_checkpointing,
        random_state = base_cfg.seed,
        use_rslora = use_rslora,
    )

    # 4. Define Args (Optimized for speed: No intermediate eval)
    args = TrainingArguments(
        output_dir = f"{base_cfg.output_dir}/hpo_tmp_{config['name']}",
        per_device_train_batch_size = batch_size,
        gradient_accumulation_steps = grad_acc,
        max_steps = max_steps,
        learning_rate = lr,
        warmup_ratio = warmup_ratio,
        lr_scheduler_type = base_cfg.lr_scheduler_type,
        weight_decay = weight_decay,
        logging_steps = max(1, max_steps // 5),
        # evaluation_strategy = "no", # Removed as it causes TypeError in current transformers version
        save_strategy = "no",
        optim = base_cfg.optim,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        report_to = "none",
    )

    # 5. Initialize Trainer
    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        dataset_text_field = "text",
        max_seq_length = base_cfg.max_seq_length,
        args = args,
        packing = False, #True
    )

    # 6. Apply masking
    trainer = train_on_responses_only(
        trainer,
        instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
        response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
    )

    # 7. Train & Eval
    trainer.train()

    # Explicit final evaluation on the validation set
    metrics = trainer.evaluate()
    final_loss = metrics.get("eval_loss", float("inf"))

    # 8. Cleanup
    del model, trainer, tokenizer
    gc.collect()
    torch.cuda.empty_cache()

    return final_loss

## **part 2optimization**

In [None]:
####### part 2optimization, run only if you want to change the hyperparameters

if do_hyper_search:
    # 1. Prepare Data Splits
    splits = dataset.train_test_split(test_size=0.1, seed=cfg.seed)
    hpo_train = splits["train"]
    hpo_valid = splits["test"]

    # 2. Define Candidates
    candidates = [
        # YOUR CONFIGURATION
        {
            "name": "cfg_user_default",
            "learning_rate": 3e-4,
            "lora_r": 16,
            "lora_alpha": 16,
            "lora_dropout": 0.0,
            "weight_decay": 0.01,
            "warmup_ratio": 0.03,
            "per_device_train_batch_size": 8,
            "gradient_accumulation_steps": 2,
            "use_rslora": False
        },
        # Experimental Config 1: Lower LR, Safe
        {
            "name": "cfg_safe_low",
            "learning_rate": 5e-5,
            "lora_r": 8,
            "lora_alpha": 16,
            "lora_dropout": 0.0,
            "weight_decay": 0.01,
            "warmup_ratio": 0.03,
            "per_device_train_batch_size": 2,
            "gradient_accumulation_steps": 4,
            "use_rslora": False
        },
        # Experimental Config 2: High R, High Alpha (Aggressive)
        {
            "name": "cfg_aggressive",
            "learning_rate": 3e-4,
            "lora_r": 32,
            "lora_alpha": 64,
            "lora_dropout": 0.0,
            "weight_decay": 0.01,
            "warmup_ratio": 0.03,
            "per_device_train_batch_size": 2,
            "gradient_accumulation_steps": 8,
            "use_rslora": True # Testing rsLoRA here
        },
        {
            "name": "cfg_recommended",
            "learning_rate": 1e-4,          # lower than 3e-4 ‚Üí more stable
            "lora_r": 32,                   # higher rank than your default 16
            "lora_alpha": 64,               # scaled with r
            "lora_dropout": 0.05,           # add a bit of regularization
            "weight_decay": 0.01,           # same as before
            "warmup_ratio": 0.10,           # more warmup for smoother start
            "per_device_train_batch_size": 2,
            "gradient_accumulation_steps": 8,  # effective batch size = 16
            "use_rslora": True,             # enable rsLoRA
        },
    ]

    # 3. Run Successive Halving
    eta = 2
    min_steps = 10
    max_steps = 30
    current_steps = min_steps
    active_candidates = candidates.copy()
    round_idx = 0

    print(f"Starting HPO with {len(active_candidates)} configs...")

    while len(active_candidates) > 1 and current_steps <= max_steps:
        round_idx += 1
        print(f"\n=== Round {round_idx} (Steps: {current_steps}) ===")

        results = []
        for cand in active_candidates:
            loss = train_once_hpo(cand, current_steps, hpo_train, hpo_valid, cfg)
            print(f"   >>> {cand['name']} Loss: {loss:.4f}")
            results.append((loss, cand))

        # Sort by loss (lower is better)
        results.sort(key=lambda x: x[0])

        # Keep top 1/eta
        n_keep = max(1, len(results) // eta)
        active_candidates = [c for _, c in results[:n_keep]]
        print(f"   >>> Promoting top {n_keep}: {[c['name'] for c in active_candidates]}")

        current_steps *= eta

    best_config = active_candidates[0]
    print(f"\nüèÜ Best Configuration: {best_config['name']}")
    print(best_config)

In [None]:
if do_hyper_search: print(cfg)

### **overwritting **

In [None]:
####### optimization, run only if you want to change the hyperparameters

if do_hyper_search:
    print("Applying best hyperparameters to main config...")

    # Overwrite global config with the winner's values
    CONFIG.learning_rate = best_config.get("learning_rate", CONFIG.learning_rate)
    CONFIG.lora_r = best_config.get("lora_r", CONFIG.lora_r)
    CONFIG.lora_alpha = best_config.get("lora_alpha", CONFIG.lora_alpha)
    CONFIG.lora_dropout = best_config.get("lora_dropout", CONFIG.lora_dropout)
    CONFIG.weight_decay = best_config.get("weight_decay", CONFIG.weight_decay)
    CONFIG.per_device_train_batch_size = best_config.get("per_device_train_batch_size", CONFIG.per_device_train_batch_size)
    CONFIG.gradient_accumulation_steps = best_config.get("gradient_accumulation_steps", CONFIG.gradient_accumulation_steps)

    # Determine if we should use rsLoRA (default to False if not in config)
    use_rslora_final = best_config.get("use_rslora", False)

    # RELOAD MODEL FOR FINAL TRAINING
    # We must do this to reset the LoRA adapters with the winning settings
    print("Reloading model for final training run...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = CONFIG.base_model,
        max_seq_length = CONFIG.max_seq_length,
        dtype = CONFIG.dtype,
        load_in_4bit = CONFIG.load_in_4bit,
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r = CONFIG.lora_r,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_alpha = CONFIG.lora_alpha,
        lora_dropout = CONFIG.lora_dropout,
        bias = "none",
        use_gradient_checkpointing = CONFIG.use_gradient_checkpointing,
        random_state = CONFIG.seed,
        use_rslora = use_rslora_final,
    )

    print("‚úÖ Model reloaded and ready for Trainer!")

In [None]:
if do_hyper_search:
    print(CONFIG)
    print(cfg)
    print(f"Are they the same object? {cfg is CONFIG}")
    print(len(dataset))
    # Output should be: True

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from google.colab import drive
from transformers.trainer_utils import get_last_checkpoint
import os

# 1. Mount Drive
drive.mount('/content/drive')

# 2. Set your permanent storage path
# This will hold your checkpoints. If you switch accounts, point this to the same shared folder.
drive_output_dir = "/content/drive/MyDrive/unsloth_lora_model"

# 3. Update Config to save directly to Drive
# We force this here to ensure it overrides any previous settings
cfg.output_dir = drive_output_dir



effective_test_size = min(cfg.eval_subset_size, len(dataset) - 1)
splits = dataset.train_test_split(test_size=effective_test_size, seed=cfg.seed)
train_ds = splits["train"]
eval_ds = splits["test"]

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    #train_dataset = dataset,
    train_dataset = train_ds, # Uses the training split
    eval_dataset = eval_ds,

    dataset_text_field = "text",
    max_seq_length = cfg.max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 1,  # Avoid multiprocessing issues on Colab
    packing = True,  # Changed from True to False to potentially resolve data format issues.
    args = TrainingArguments(
        per_device_train_batch_size = cfg.per_device_train_batch_size,
        gradient_accumulation_steps = cfg.gradient_accumulation_steps,
        warmup_steps = cfg.warmup_steps,
        max_steps = cfg.max_steps,  # Set None and use num_train_epochs for a full run.
        num_train_epochs = cfg.num_train_epochs,
        learning_rate = cfg.learning_rate,
        warmup_ratio = cfg.warmup_ratio,  # first X% of steps increase gradually
        lr_scheduler_type = cfg.lr_scheduler_type,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = cfg.logging_steps,
        save_strategy = cfg.save_strategy,

        save_steps = cfg.save_steps,
        save_total_limit = cfg.save_total_limit,
        optim = cfg.optim,
        weight_decay = cfg.weight_decay,
        seed = cfg.seed,
        output_dir = cfg.output_dir,
        report_to = "none",  # Use this for WandB etc
    ),
)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/99488 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/512 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
    num_proc = 1, # Set num_proc to 1 to avoid multiprocessing issues with .tolist()
)



Map (num_proc=1):   0%|          | 0/99488 [00:00<?, ? examples/s]

Map (num_proc=1):   0%|          | 0/512 [00:00<?, ? examples/s]

We verify masking is actually done:

In [None]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow can I create a function to check whether a given input is an even number or not?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou can use the following code to create a function that determines whether the given input is even or not:\n\n```python\ndef isEven(input):\n    if (input % 2 == 0):\n        return True\n    else:\n        return False\n```\n\nHere's how the code works:\n\n1. The function `isEven` takes an `input` as a parameter.\n2. Inside the function, it checks if the `input` modulo 2 is equal to 0. The modulo operator `%` returns the remainder of the division.\n3. If the remainder is 0, it means that the `input` is perfectly divisible by 2, indicating that it is an even number. In this case, the function returns `True`.\n4. If the remainder is not 0, it mean

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                                       You can use the following code to create a function that determines whether the given input is even or not:\n\n```python\ndef isEven(input):\n    if (input % 2 == 0):\n        return True\n    else:\n        return False\n```\n\nHere's how the code works:\n\n1. The function `isEven` takes an `input` as a parameter.\n2. Inside the function, it checks if the `input` modulo 2 is equal to 0. The modulo operator `%` returns the remainder of the division.\n3. If the remainder is 0, it means that the `input` is perfectly divisible by 2, indicating that it is an even number. In this case, the function returns `True`.\n4. If the remainder is not 0, it means that the `input` is not divisible by 2, indicating that it is an odd number. In this case, the function returns `False`.\n\nBy using this function, you can easily determine whether a given input is an even number or not.<|eot_id|>"

We can see the System and Instruction prompts are successfully masked!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.203 GB of memory reserved.


In [None]:
from transformers.trainer_utils import get_last_checkpoint
import os

# 1. Check if we have a valid checkpoint in the output directory
last_checkpoint = None
if os.path.isdir(cfg.output_dir):
    last_checkpoint = get_last_checkpoint(cfg.output_dir)

# 2. Run training (resume if checkpoint found, otherwise start fresh)
if last_checkpoint:
    print(f"Resuming training from checkpoint: {last_checkpoint}")
    trainer_stats = trainer.train(resume_from_checkpoint=True)
else:
    print("No checkpoint found. Starting training from scratch.")
    trainer_stats = trainer.train()
    print(trainer_stats.metrics)


The model is already on multiple devices. Skipping the move to device specified in `args`.


Resuming training from checkpoint: /content/drive/MyDrive/unsloth_lora_model/checkpoint-6218


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 99,488 | Num Epochs = 1 | Total steps = 6,218
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 22,544,384 of 1,258,358,784 (1.79% trained)


Step,Training Loss


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

0.0235 seconds used for training.
0.0 minutes used for training.
Peak reserved memory = 1.215 GB.
Peak reserved memory for training = 0.012 GB.
Peak reserved memory % of max memory = 8.242 %.
Peak reserved memory for training % of max memory = 0.081 %.


### Quick evaluation & sanity checks
Define a helper you can call *after* `trainer.train()` to sanity-check loss/metrics on a small subset before scaling up.


In [None]:
#######run omly training in order to evalaute the trained model


def run_quick_eval():

    # Create a copy of the processed evaluation dataset from the trainer.
    # This dataset should already have 'input_ids' and 'labels' after train_on_responses_only.
    # Use .select(range(len(dataset))) to create a proper copy for datasets.Dataset objects.
    eval_dataset_for_eval = trainer.eval_dataset.select(range(len(trainer.eval_dataset)))

    # Remove columns that cause conflict with remove_unused_columns=True in TrainingArguments.
    # The ValueError explicitly mentioned these columns being ignored/causing issues.
    columns_to_remove = ["conversations", "score", "source", "text"]
    for col in columns_to_remove:
        if col in eval_dataset_for_eval.column_names:
            eval_dataset_for_eval = eval_dataset_for_eval.remove_columns(col)

    metrics = trainer.evaluate(eval_dataset = eval_dataset_for_eval, metric_key_prefix = "eval")
    print(metrics)
    return metrics

# Example usage once training has finished:
run_quick_eval()


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

TrainConfig(base_model='unsloth/Llama-3.2-1B-Instruct', max_seq_length=2048, dtype=None, load_in_4bit=True, dataset_name='mlabonne/FineTome-100k', dataset_split='train', subset_size=10000, eval_subset_size=512, output_dir='/content/drive/MyDrive/unsloth_lora_model', per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=0.0003, warmup_steps=5, warmup_ratio=0.03, lr_scheduler_type='linear', weight_decay=0.01, max_steps=-1, num_train_epochs=1, logging_steps=1, save_strategy='steps', save_steps=50, save_total_limit=2, optim='adamw_8bit', seed=0, lora_r=32, lora_alpha=64, lora_dropout=0.0, use_gradient_checkpointing='unsloth')
Are they the same object? True

In [None]:
from typing import Dict, List
from unsloth.chat_templates import get_chat_template


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference


def generate_response(
    messages: List[Dict[str, str]],
    *,
    max_new_tokens: int = 128,
    temperature: float = 0.9,
    min_p: float = 0.1,
    stream: bool = False,
):
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")

    generation_kwargs = dict(
        input_ids = inputs,
        max_new_tokens = max_new_tokens,
        use_cache = True,
        temperature = temperature,
        min_p = min_p,
    )

    if stream:
        from transformers import TextStreamer
        generation_kwargs["streamer"] = TextStreamer(tokenizer, skip_prompt = True)
        model.generate(**generation_kwargs)
        return None

    outputs = model.generate(**generation_kwargs)
    return tokenizer.batch_decode(outputs, skip_special_tokens = True)[0]


# Example usage
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
demo_reply = generate_response(messages, max_new_tokens = 64)
print(demo_reply)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


system

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

user

Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,assistant

The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones. Here is the continued sequence:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
_ = generate_response(
    [{"role": "user", "content": "Give me 3 creative use cases for LoRA adapters."}],
    stream = True,
    max_new_tokens = 128,
    temperature = 1.1,
    min_p = 0.1,
)
# Streaming returns None because tokens are printed incrementally.

1. Smart Lighting: LoRA adapters can be used to create smart lighting systems by integrating them with LED strips. The LoRA adapter is used to power and control the LED strips, which can change brightness or color based on the application. This creates an energy-efficient and adaptive lighting system.

2. IoT Devices: LoRA adapters can be used as a bridge between IoT devices and the internet, making it easier to connect these devices to the cloud. The LoRA adapter is used to transmit data between the devices and the cloud, ensuring that the data is received correctly and that the devices are able to communicate effectively.

3. Wearable Technology


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
HF_USER = "oloflil"  # Change this to your Hugging Face handle
HF_TOKEN = None  #TODO add your access token here

model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub(HF_USER + "/model", token = HF_TOKEN) # Online saving
tokenizer.push_to_hub(HF_USER+ "/model", token = HF_TOKEN) # Online saving

README.md:   0%|          | 0.00/616 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   1%|          |  553kB / 90.2MB            

Saved model to https://huggingface.co/oloflil/model


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mpgc81mzif/tokenizer.json: 100%|##########| 17.2MB / 17.2MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = cfg.max_seq_length,
        dtype = cfg.dtype,
        load_in_4bit = cfg.load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

A tall tower stands proudly at the center of a square city square in Paris. Its towering height is almost 600 meters, which is considered a national landmark. The tower is a marvel to the architecture and engineering of its time and stands as a testament to the creativity and innovation of the French people during the Industrial Revolution era.

The tower is adorned with intricate carvings and intricate details of French architecture. It is clad in stone and is built in an architectural style that is influenced by the neoclassical style. The tower has been a popular tourist destination since its construction in the 17th and 18th centuries. Today,


### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
import os

def push_merged_model(save_method: str = "merged_16bit", repo_name: str = "model"):
    if not HF_TOKEN:
        raise ValueError("Set HF_TOKEN env var or login with huggingface-cli before pushing.")
    repo_id = f"{HF_USER}/{repo_name}"
    print(f"Uploading {save_method} weights to {repo_id}...")
    model.push_to_hub_merged(repo_id, tokenizer, save_method = save_method, token = HF_TOKEN)


# Examples (uncomment the ones you need):
#model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
#push_merged_model(save_method = "merged_16bit", repo_name = "my-llama-run")
# model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit")
# push_merged_model(save_method = "lora", repo_name = "my-llama-run-lora")

In [None]:

merged_model_dir = "model"
gdrive_gguf_output_file = drive_output_dir + "model-1b-Q8_0.gguf" #put your folder ( I mounted drive on this notebook)
# Save tokenizer along with merged model
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
tokenizer.save_pretrained("model")
import os

# --- 4. PREPARE LLAMA.CPP ---
if not os.path.isdir("llama.cpp"):
    !git clone https://github.com/ggerganov/llama.cpp.git
else:
    print("llama.cpp already present.")
!pip install -U "transformers" "huggingface_hub"
!cd llama.cpp && make -s

#  GGUF conversion
!python llama.cpp/convert_hf_to_gguf.py {merged_model_dir} --outfile {gdrive_gguf_output_file} --outtype q8_0

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:33<00:00, 93.95s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:02<00:00, 62.67s/it]


Unsloth: Merge process complete. Saved to `/content/model`
llama.cpp already present.
Collecting huggingface_hub
  Using cached huggingface_hub-1.2.1-py3-none-any.whl.metadata (13 kB)
Makefile:6: *** Build system changed:
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> Q8_0, shape = {2048, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> Q8_0, shape = {8192, 2048}
INFO:hf-to-gguf:blk.

Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/894 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:   0%|          | 0/1 [02:36<?, ?it/s]


KeyboardInterrupt: 

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with ü§ó HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>

### Human evaluation
Rating the finetuned vs regular model performance on random prompts

In [None]:
# Create the prompt file directly in Colab
prompts_content = """Explain the concept of quantum entanglement to a 5-year-old.
Write a python function for the Fibonacci sequence.
Give me a recipe for a healthy breakfast.
Write a short story about a robot who discovers emotions.
Explain the difference between SQL and NoSQL.
"""

with open("eval_prompts.txt", "w") as f:
    f.write(prompts_content)

print("‚úÖ eval_prompts.txt created successfully!")



‚úÖ eval_prompts.txt created successfully!


In [None]:
import torch
import os
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

# --- Configuration ---
# IF YOUR MODEL IS ON HUGGING FACE: Change this to "your-username/model-name"
# IF YOUR MODEL IS LOCAL (Colab): Keep it as "lora_model" (or path to Drive)
FINETUNED_MODEL_PATH = "lora_model"
PROMPT_FILE_PATH = "eval_prompts.txt"
MAX_SEQ_LENGTH = 2048
DTYPE = None
LOAD_IN_4BIT = True

def load_prompts(file_path):
    if not os.path.exists(file_path):
        print(f"Error: {file_path} not found.")
        return []
    with open(file_path, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f if line.strip()]

def load_model_for_eval(model_path):
    print(f"Loading model from: {model_path}...")
    try:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_path,
            max_seq_length=MAX_SEQ_LENGTH,
            dtype=DTYPE,
            load_in_4bit=LOAD_IN_4BIT,
        )
    except OSError:
        print(f"‚ùå Error: Could not find folder '{model_path}'.")
        print("If you just restarted Colab, your local files are gone.")
        print("Please mount Google Drive or use a HuggingFace model ID.")
        raise

    # Set up the tokenizer for chat
    tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
    FastLanguageModel.for_inference(model)
    return model, tokenizer

def generate_response(model, tokenizer, prompt):
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=256,
        use_cache=True,
        temperature=0.7,
    )

    # Slice off the input tokens so we only see the new answer
    generated_tokens = outputs[0][inputs.shape[1]:]
    return tokenizer.decode(generated_tokens, skip_special_tokens=True)

# --- Main Logic ---
def run_evaluation():
    prompts = load_prompts(PROMPT_FILE_PATH)
    if not prompts: return

    # Load the fine-tuned model
    # (We will disable the adapter later to simulate the base model)
    model, tokenizer = load_model_for_eval(FINETUNED_MODEL_PATH)

    stats = {"base": 0, "finetuned": 0, "tie": 0}

    print(f"\nüöÄ Starting evaluation on {len(prompts)} prompts...")

    for i, prompt in enumerate(prompts):
        print(f"\n{'='*60}")
        print(f"üìù PROMPT {i+1}/{len(prompts)}: {prompt}")
        print(f"{'='*60}")

        # 1. Generate BASE response (Disable LoRA)
        print("‚è≥ Generating Base response...", end="\r")
        with model.disable_adapter():
            base_resp = generate_response(model, tokenizer, prompt)

        # 2. Generate FINETUNED response (LoRA Active)
        print("‚è≥ Generating Fine-tuned response...", end="\r")
        ft_resp = generate_response(model, tokenizer, prompt)

        # 3. Print Results for Comparison
        print("\n" + "--- [1] BASE MODEL ---".center(60))
        print(base_resp.strip())
        print("\n" + "--- [2] FINE-TUNED ---".center(60))
        print(ft_resp.strip())
        print("-" * 60)

        # 4. User Vote
        while True:
            choice = input(f"üèÜ Vote (1=Base, 2=Finetuned, 3=Tie): ").strip()
            if choice == '1':
                stats["base"] += 1; break
            elif choice == '2':
                stats["finetuned"] += 1; break
            elif choice == '3':
                stats["tie"] += 1; break
            else:
                print("Invalid choice.")

    # Final Score
    print("\n" + "#"*30)
    print("üìä FINAL RESULTS")
    print("#"*30)
    print(f"Base Model Wins:   {stats['base']}")
    print(f"Fine-tuned Wins:   {stats['finetuned']}")
    print(f"Ties:              {stats['tie']}")

# Run it
run_evaluation()

Loading model from: lora_model...
==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

üöÄ Starting evaluation on 5 prompts...

üìù PROMPT 1/5: Explain the concept of quantum entanglement to a 5-year-old.

                   --- [1] BASE MODEL ---                   
Imagine you have two toy cars, one red and one blue. You play with them separately, but then you connect them to a special machine that makes them talk to each other.

If you do something to the red toy car, like make it go "vroom", the blue toy car will be affected too. Even if they're on opposite sides of the world, they'r