To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [21]:
%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy, PIL; get_numpy = f"numpy=={numpy.__version__}"; get_pil = f"pillow=={PIL.__version__}"
    except: get_numpy = "numpy"; get_pil = "pillow"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} {get_pil} torchvision bitsandbytes "transformers==4.56.2" \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth
# Ensure unsloth and unsloth_zoo are updated to the latest version
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

### Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through a finetuning example. To use our `MXFP4` inference example, use this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B)-Inference.ipynb) instead.

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
dtype = None

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    dtype = dtype, # None for auto detection
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


### Reasoning Effort
The `gpt-oss` models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

----

The `gpt-oss` models offer three distinct levels of reasoning effort you can choose from:

* **Low**: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
* **Medium**: A balance between performance and speed.
* **High**: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

In [6]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to("cuda")

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>Equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13=0. Solve for real roots. Maybe factor? try integer roots: try x=1:1+3-13=-9 no. x=2


Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [7]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high
).to("cuda")

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "Solve x^5 + 3x^4 - 10 = 3." Probably solve for x? The equation is x^5 + 3x^4 - 10 = 3. Move 3 to left: x^5 + 3x^4


Lastly we will test it using `reasoning_effort` to `high`

In [8]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high
).to("cuda")

_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>The prompt: "Solve x^5 + 3x^4 - 10 = 3." Solve the equation. So we set x^5 + 3x^4 - 10 = 3. That simplifies: x^5 + 3x^4 - 10 = 


<a name="Data"></a>
### Data Prep

The `HuggingFaceH4/Multilingual-Thinking` dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's [cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

In [9]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset

dataset = load_dataset("json", data_files={"train": ["/content/dataset_openai_20251115_122647.jsonl", "/content/dataset_openai_20251115_141330.jsonl"]}, split="train")
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['messages'],
    num_rows: 1677
})

To format our dataset, we will apply our version of the GPT OSS prompt

In [24]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

# Custom function to tokenize and mask labels, replacing unsloth's train_on_responses_only functionality
def generate_tokenized_and_masked_examples(examples):
    input_ids_batch = []
    labels_batch = []

    for messages_list in examples["messages"]:
        # Apply chat template and tokenize to get the full sequence
        tokenized_output = tokenizer.apply_chat_template(
            messages_list,
            add_generation_prompt=False, # We want the full conversation, not just prompt
            return_tensors="pt",
            max_length=max_seq_length,
            truncation=True,
            padding=False, # Padding will be handled by the DataCollator later
        )
        input_ids = tokenized_output[0].tolist()

        # Initialize labels to -100 (ignore loss by default)
        labels = [-100] * len(input_ids)

        # Identify the token sequence for "<|start|>assistant"
        assistant_start_tokens = tokenizer.encode("<|start|>assistant", add_special_tokens=False)

        current_idx = 0
        while current_idx < len(input_ids):
            # Search for the assistant_start_tokens sequence
            match_found = False
            if len(input_ids) - current_idx >= len(assistant_start_tokens):
                if input_ids[current_idx:current_idx + len(assistant_start_tokens)] == assistant_start_tokens:
                    match_found = True

            if match_found:
                response_start_idx = current_idx + len(assistant_start_tokens)

                # Determine the end of the assistant's response
                # It ends at the next special token like <|end|> or <|start|>user
                response_end_idx = len(input_ids)

                for i in range(response_start_idx, len(input_ids)):
                    if input_ids[i] == tokenizer.eos_token_id: # '<|end|>' token
                        response_end_idx = i
                        break

                    user_start_tokens = tokenizer.encode("<|start|>user", add_special_tokens=False)
                    if input_ids[i:i+len(user_start_tokens)] == user_start_tokens:
                        response_end_idx = i
                        break

                # Copy input_ids to labels for the identified response segment
                for i in range(response_start_idx, response_end_idx):
                    if i < len(labels):
                        labels[i] = input_ids[i]

                current_idx = response_end_idx # Continue searching from the end of this response
            else:
                current_idx += 1

        input_ids_batch.append(input_ids)
        labels_batch.append(labels)

    return {"input_ids": input_ids_batch, "labels": labels_batch}

# Apply this new mapping function to create input_ids and labels
dataset = dataset.map(
    generate_tokenized_and_masked_examples,
    batched=True,
    remove_columns=dataset.column_names # Remove original columns, keep only 'input_ids', 'labels'
)

Map:   0%|          | 0/1677 [00:00<?, ? examples/s]

Let's take a look at the dataset, and check what the 1st example shows

In [11]:
print(dataset[0]['text'])

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>what is the purpose of the network safeworking rules and procedures?<|end|><|start|>assistant<|message|>the network safeworking rules and procedures are designed to ensure the safety and efficiency of rail operations within the arc infrastructure network. they provide a structured framework for developing, managing, and implementing rules to maintain safe rail practices.<|end|><|start|>user<|message|>who authorizes changes to these rules?<|end|><|start|>assistant<|message|>changes and updates to the network safeworking rules and procedures are authorized by the arc infrastructure rule book committee.<|return|>


What is unique about GPT-OSS is that it uses OpenAI [Harmony](https://github.com/openai/harmony) format which support conversation structures, reasoning output, and tool calling.

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [36]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 1e-5, # Adjusted learning rate to a lower value
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes and lower loss as well!

In [25]:
# The train_on_responses_only function was causing an AttributeError and has been replaced by custom masking.
# The dataset now directly contains 'input_ids' and 'labels' with appropriate masking.

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [28]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-11-16\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>when should a handsignal be maintained for an all clear signal?<|end|><|start|>assistant<|message|>an all clear handsignal must be continued until it is acknowledged by the rail traffic crew. This ensures that the crew is aware of the signal and can proceed safely.<|end|><|start|>user<|message|>what if the signal is a stop handsignal?<|end|><|start|>assistant<|message|>for a stop handsignal, it must be maintained until the rail traffic has come to a complete stop or until the handsignaller displays another handsignal.<|return|>"

Now let's print the masked out example - you should see only the answer is present:

In [29]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                  '

In [19]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
12.816 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [37]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,677 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)


Step,Training Loss
1,15.6509
2,15.1378
3,14.729
4,14.0001
5,14.6562
6,14.9887
7,13.796
8,14.3435
9,13.0505
10,13.1698


Unsloth: Will smartly offload gradients to save VRAM!


In [38]:
print(trainer_stats.metrics)

{'train_runtime': 488.7294, 'train_samples_per_second': 0.246, 'train_steps_per_second': 0.061, 'total_flos': 3780237637287936.0, 'train_loss': 12.27776606877645, 'epoch': 0.07155635062611806}


In [39]:
import numpy as np

found_nan_inf = False
for i, item in enumerate(trainer.train_dataset):
    # Check input_ids
    input_ids_np = np.array(item['input_ids'], dtype=float)
    if np.isnan(input_ids_np).any() or np.isinf(input_ids_np).any():
        print(f"NaN or Inf found in input_ids at index {i}")
        found_nan_inf = True

    # Check labels
    labels_np = np.array(item['labels'], dtype=float)
    if np.isnan(labels_np).any() or np.isinf(labels_np).any():
        print(f"NaN or Inf found in labels at index {i}")
        found_nan_inf = True

    if found_nan_inf and i < 5: # Limit output for brevity if many are found
        # print(f"  Input IDs: {item['input_ids']}")
        # print(f"  Labels: {item['labels']}")
        pass

if not found_nan_inf:
    print("No NaN or Inf values found in input_ids or labels across the dataset.")
else:
    print("Further investigation needed for samples with NaN/Inf values.")

No NaN or Inf values found in input_ids or labels across the dataset.


In [40]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

488.7294 seconds used for training.
8.15 minutes used for training.
Peak reserved memory = 12.816 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 86.941 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [43]:
messages = [
    {"role": "user", "content": "YOu work for Arc Infrastructure, Western Australia. what's the longest time lookout can be on duty according to Arc's Rules?"},
    {"role": "assistant", "content": ""},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to("cuda")
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>YOu work for Arc Infrastructure, Western Australia. what's the longest time lookout can be on duty according to Arc's Rules?<|end|><|start|>assistant<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "You work for Arc Infrastructure, Western Australia. what's the longest time lookout can be on duty according to Arc's Rules?"

They are presumably referring to Arc Infrastructure rules in Western Australia about how long a lookout (someone on duty, likely road traffic?) can be on duty. Possibly about


```markdown
Let's inspect the `messages` column of the first example in your dataset to understand its structure.
```

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** Currently finetunes can only be loaded via Unsloth in the meantime - we're working on vLLM and GGUF exporting!

In [44]:
model.save_pretrained("finetuned_model")
# model.push_to_hub("hf_username/finetuned_model", token = "hf_...") # Save to HF

To run the finetuned model, you can do the below after setting `if False` to `if True` in a new instance.

In [45]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "finetuned_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 1024,
        dtype = None,
        load_in_4bit = True,
    )

messages = [
    {"role": "assistant", "content": " You are an expert in Australian rail safe working, an incident investigator, and a senior safe working planner for Arc Infrastructure."},
    {"role": "user", "content": "what should the network controller do if a train order is to be fulfilled?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to("cuda")
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 64, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-16

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>assistant<|message|> You are an expert in Australian rail safe working, an incident investigator, and a senior safe working planner for Arc Infrastructure.<|end|><|start|>user<|message|>what should the network controller do if a train order is to be fulfilled?<|end|><|start|>assistant<|channel|>analysis<|message|>The user: "what should the network controller do if a train order is to be fulfilled?"

It seems like a question about a procedure. The user states they're looking for guidance or steps for a network controller (likely in context of rail operations in Australia) when a train order is to be fulfilled. They


### Saving to float16 for VLLM or mxfp4

We also support saving to `float16` or `mxfp4` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [46]:
# Merge and push to hub in mxfp4 4bit format
if True:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "mxfp4")
if False: model.push_to_hub_merged("repo_id/repo_name", tokenizer, token = "hf...", save_method = "mxfp4")

# Merge and push to hub in 16bit
if False:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gpt-oss-finetune", tokenizer, save_method = "merged_16bit", token = "")

config.json: 0.00B [00:00, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00000-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [09:30<19:00, 570.09s/it]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 2/3 [12:21<05:35, 335.70s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [14:47<00:00, 295.94s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into mxfp4: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [01:38<00:00, 32.99s/it]


Unsloth: Merge process complete. Saved to `/content/finetuned_model`


In [None]:
!zip -r /content/finetuned_model.zip /content/finetuned_model
from google.colab import files
files.download('/content/finetuned_model.zip')

  adding: content/finetuned_model/ (stored 0%)
  adding: content/finetuned_model/README.md (deflated 65%)
  adding: content/finetuned_model/.cache/ (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/ (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/ (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/model-00001-of-00002.safetensors.lock (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/model-00002-of-00002.safetensors.lock (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/tokenizer.model.lock (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/model.safetensors.index.json.lock (stored 0%)
  adding: content/finetuned_model/.cache/huggingface/download/model-00000-of-00002.safetensors.metadata (deflated 29%)
  adding: content/finetuned_model/.cache/huggingface/download/model-00000-of-00002.safetensors.lock (stored 0%)
  adding: content/finetuned_model/.cache/

In [48]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).


# Task
Modify the code to load a custom dataset from the files `/content/dataset_openai_20251115_122647.jsonl` and `/content/dataset_openai_20251115_141330.jsonl`, then verify that the loaded dataset contains a 'messages' column.

## Load Custom Dataset

### Subtask:
Modify the code cell to load the custom dataset from the specified local .jsonl files.


**Reasoning**:
The subtask requires modifying the `load_dataset` call in cell `62QfuPXBnVgi` to load data from specified local JSONL files. I will update the `load_dataset` function to use the `json` builder and provide the paths to the local `.jsonl` files.



In [22]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset

dataset = load_dataset("json", data_files={"train": ["/content/dataset_openai_20251115_122647.jsonl", "/content/dataset_openai_20251115_141330.jsonl"]}, split="train")
dataset

Dataset({
    features: ['messages'],
    num_rows: 1677
})

## Verify Dataset Structure

### Subtask:
Inspect the structure of the loaded dataset to ensure it contains a 'messages' column.


## Summary:

### Data Analysis Key Findings
*   A custom dataset was successfully loaded from two local JSONL files (`/content/dataset_openai_20251115_122647.jsonl` and `/content/dataset_openai_20251115_141330.jsonl`) into a "train" split.
*   The loaded dataset contains `1677` rows and includes the expected `messages` column.

### Insights or Next Steps
*   The successful loading and verification of the `messages` column confirm that the dataset is ready for subsequent processing steps, such as formatting or tokenization.
