To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


Let's take a look at the dataset, and check what the 1st example shows

What is unique about GPT-OSS is that it uses OpenAI [Harmony](https://github.com/openai/harmony) format which support conversation structures, reasoning output, and tool calling.

In [1]:
# -*- coding: utf-8 -*-
# --- Full Fine-tuning Script for gpt-oss-20b with Unsloth ---

# --- 1. Installation ---
# Ensure necessary libraries are installed (run this cell first in Colab)
# %%capture
import os, importlib.util
# !pip install --upgrade -qqq uv # Already done if running previous cells
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy, PIL; get_numpy = f"numpy=={numpy.__version__}"; get_pil = f"pillow=={PIL.__version__}"
    except: get_numpy = "numpy"; get_pil = "pillow"
    # Note: transformers==4.56.2 is specified in the original code, ensure compatibility
    # If issues arise, check Unsloth docs for recommended versions.
    print("Installing necessary packages...")
    os.system(f"""
    uv pip install -qqq \\
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} {get_pil} torchvision bitsandbytes "transformers==4.56.2" \\
        datasets "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels \\
        "trl>=0.22.0" # Ensure TRL version supports needed features
    """)
    print("Installation complete.")
elif importlib.util.find_spec("unsloth") is None:
    print("Installing Unsloth...")
    os.system("uv pip install -qqq unsloth[base]")
    print("Installation complete.")


Installing necessary packages...
Installation complete.


In [2]:
# --- 2. Imports ---
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from transformers import TextStreamer
import gc # For garbage collection


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
# --- 3. Configuration ---
model_name = "unsloth/gpt-oss-20b" # Or "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
max_seq_length = 2048  # Choose based on your VRAM and data, e.g., 1024, 2048
load_in_4bit = True   # Use 4-bit quantization
dtype = None          # Auto detection
lora_r = 8            # LoRA rank
lora_alpha = 16       # LoRA alpha
per_device_train_batch_size = 1 # Keep low for 20B on T4/free Colab
gradient_accumulation_steps = 8 # Effective batch size = batch_size * grad_accum
num_train_epochs = 2  # Number of passes over the dataset (1-3 recommended)
learning_rate = 2e-4
output_dir = "outputs"
data_file = "finetune_data.jsonl" # Your exported JSONL file


In [4]:

# --- 4. Load Model and Tokenizer ---
print("Loading model and tokenizer...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    dtype = dtype,
    max_seq_length = max_seq_length,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # Use if loading gated models
)
print("Model and tokenizer loaded.")


Loading model and tokenizer...
==((====))==  Unsloth 2025.10.8: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model and tokenizer loaded.


In [5]:

# --- 5. Add LoRA Adapters ---
print("Adding LoRA adapters...")
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_r,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # Standard targets for many models
    lora_alpha = lora_alpha,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Crucial for memory saving
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)
print("LoRA adapters added.")

Adding LoRA adapters...
Unsloth: Making `model.base_model.model.model` require gradients
LoRA adapters added.


In [6]:
# --- 6.
print("Loading and preparing dataset...")

# Define the formatting function (Keep this)
def formatting_prompts_func(examples):
    messages = examples["messages"]
    # Apply chat template to each conversation in the batch
    texts = [tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False) for msgs in messages]
    return { "text": texts, }

# Load your dataset from the JSONL file (Keep this)
try:
    dataset = load_dataset("json", data_files=data_file, split="train")
except Exception as e:
    print(f"Error loading dataset '{data_file}': {e}")
    print("Please ensure the file exists and is a valid JSONL.")
    exit()

# Apply the formatting function to create the "text" column (Keep this)
formatted_dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
    desc="Formatting prompts",
)
print(f"Dataset structure after formatting: {formatted_dataset}")

Loading and preparing dataset...


Generating train split: 0 examples [00:00, ? examples/s]

Formatting prompts:   0%|          | 0/950 [00:00<?, ? examples/s]

Dataset structure after formatting: Dataset({
    features: ['messages', 'text'],
    num_rows: 950
})


In [7]:
# --- 7. Configure Trainer ---
print("Configuring SFTTrainer...")
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_dataset,  # <<< Pass the dataset with the 'text' column
    dataset_text_field = "text",       # <<< Tell trainer where the formatted text is
    max_seq_length = max_seq_length,
    # packing = True, # <<< REMOVE or set to False. Packing often requires pre-tokenization.
                      # Let's start without it for simplicity when using assistant_only_loss.
    dataset_num_proc = 2,
    args = SFTConfig(
        per_device_train_batch_size = per_device_train_batch_size,
        gradient_accumulation_steps = gradient_accumulation_steps,
        warmup_steps = 10,
        num_train_epochs = num_train_epochs,
        # max_steps = 30, # Use max_steps for quick testing instead of epochs
        learning_rate = learning_rate,
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = output_dir,
        report_to = "none",
        save_strategy = "epoch",
        # save_steps = 100,
        assistant_only_loss = True, # <<< Keep this, it should work now
        # bf16 = torch.cuda.is_bf16_supported(),
        # tf32 = torch.cuda.is_tf32_supported(),
    ),
)
print("Trainer configured.")

Configuring SFTTrainer...
Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/950 [00:00<?, ? examples/s]

Trainer configured.


In [8]:

# --- 8. Memory & GPU Check ---
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved before training.")


GPU = Tesla T4. Max memory = 14.741 GB.
12.877 GB of memory reserved before training.


In [None]:

# --- 9. Train ---
print("Starting training...")
trainer_stats = trainer.train()
print("Training finished.")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998, 'pad_token_id': 200017}.


Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 950 | Num Epochs = 2 | Total steps = 238
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,6.3477


Step,Training Loss
5,6.3477
10,4.5337


In [None]:
# --- 10. Post-Training Stats ---
try:
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    used_percentage = round(used_memory / max_memory * 100, 3)
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
    print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
    print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
    print(f"Peak reserved memory % of max memory = {used_percentage} %.")
    print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
except KeyError:
    print("Could not retrieve training runtime metrics.")
except Exception as e:
    print(f"An error occurred while printing memory stats: {e}")


In [None]:
# --- 11. Save the Fine-tuned Model (LoRA Adapters) ---
print("Saving LoRA adapters...")
save_directory = "finetuned_model_adapters"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory) # Save tokenizer too
print(f"LoRA adapters saved to {save_directory}")

# --- Optional: Clean up memory ---
del model
del trainer
del dataset
del formatted_dataset
del tokenized_dataset
gc.collect()
torch.cuda.empty_cache()
gc.collect()
print("Cleaned up memory.")

In [None]:
# --- 12. (Optional) Load and Run Inference ---
run_inference = False # Set to True to run a quick test after saving

if run_inference:
    print("\n--- Running Inference Test ---")
    try:
        from unsloth import FastLanguageModel # Re-import if needed

        # Load the base model with the saved LoRA adapters merged
        print("Loading fine-tuned model for inference...")
        inf_model, inf_tokenizer = FastLanguageModel.from_pretrained(
            model_name = save_directory, # Load from where you saved adapters
            max_seq_length = max_seq_length,
            dtype = dtype,
            load_in_4bit = load_in_4bit,
        )
        FastLanguageModel.for_inference(inf_model) # Optimize for inference
        print("Model loaded for inference.")

        # Example inference prompt (modify as needed)
        messages = [
            # IMPORTANT: For inference, DON'T include the developer prompt
            # if you baked it in during training.
            {"role": "user", "content": "Hi, I need a camera for wildlife, not too heavy."},
            # Add more history if needed for context testing
        ]

        inputs = inf_tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = True, # Add assistant prompt marker
            return_tensors = "pt",
            return_dict = True,
            reasoning_effort = "medium", # Choose reasoning effort
        ).to("cuda")

        print("\nGenerating response...")
        streamer = TextStreamer(inf_tokenizer)
        _ = inf_model.generate(
            **inputs,
            max_new_tokens = 128, # Adjust as needed
            streamer = streamer,
            use_cache = True,
            # Common generation parameters (optional)
            # temperature=0.7,
            # top_p=0.9,
            # do_sample=True,
            )
        print("\nInference complete.")

    except Exception as e:
        print(f"Error during inference test: {e}")
        print("Skipping inference.")

print("\n--- Script Finished ---")