<a href="https://colab.research.google.com/github/procop07/LLM/blob/main/nb/GPT_OSS_MXFP4_(20B)-Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
except: install_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    torchvision bitsandbytes \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels


### Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through an inference example. For our `bnb-4bit` version, use this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_BNB_(20B)-Inference.ipynb) instead.

**We're using OpenAI's MXFP4 Triton kernels combined with Unsloth's kernels!**

In [4]:
from unsloth import FastLanguageModel
import torch
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

# Using bnb-4bit model with explicit load_in_4bit parameter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gpt-oss-20b-unsloth-bnb-4bit",
    load_in_4bit=True,
    dtype=None, # None for auto detection
    max_seq_length=4096,
)

==((====))==  Unsloth 2025.8.4: Fast Gpt_Oss patching. Transformers: 4.56.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

### Reasoning Effort
The `gpt-oss` models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

----

The `gpt-oss` models offer three distinct levels of reasoning effort you can choose from:

* **Low**: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
* **Medium**: A balance between performance and speed.
* **High**: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

In [5]:
# Step A: Mount Google Drive, install dependencies, check system
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install dependencies and git-lfs
!pip install -q transformers unsloth peft bitsandbytes accelerate
!apt-get install -y git-lfs
!git lfs install

# Freeze requirements
!pip freeze > /content/drive/MyDrive/requirements.txt

# Check GPU (T4)
!nvidia-smi

# Check imports
import os
try:
    from unsloth import FastLanguageModel
    print("✓ Unsloth imported successfully")
except ImportError as e:
    print(f"✗ Unsloth import failed: {e}")

try:
    import transformers
    print(f"✓ Transformers version: {transformers.__version__}")
except ImportError as e:
    print(f"✗ Transformers import failed: {e}")

try:
    import peft
    print(f"✓ PEFT version: {peft.__version__}")
except ImportError as e:
    print(f"✗ PEFT import failed: {e}")

print("Шаг A завершен: монтирование Drive, установка зависимостей, проверка системы")

ValueError: mount failed

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

In [6]:
# ШАГ A: Альтернативная установка без Google Drive
print("=== ШАГ A НАЧАТ (локальная версия) ===")

# Установка дополнительных зависимостей
!pip install -q huggingface_hub
!apt-get install -y git-lfs
!git lfs install

# Сохранение требований локально
!pip freeze > /tmp/requirements.txt

# Проверка GPU (T4)
print("\n=== Проверка GPU ===")
!nvidia-smi

# Проверка импортов
print("\n=== Проверка импортов ===")
import os
try:
    from unsloth import FastLanguageModel
    print("✓ Unsloth импортирован успешно")
except ImportError as e:
    print(f"✗ Ошибка импорта Unsloth: {e}")
try:
    import transformers
    print(f"✓ Transformers версия: {transformers.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта Transformers: {e}")
try:
    import peft
    print(f"✓ PEFT версия: {peft.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта PEFT: {e}")

print("\n=== ШАГ A ЗАВЕРШЕН (локально) ===")

=== ШАГ A НАЧАТ ===


KeyboardInterrupt: 

In [7]:
# ШАГ B: Создание структуры папок в Drive
print("=== ШАГ B НАЧАТ ===")

import os

# Определение базового пути
base_path = "/content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B"

# Создание структуры папок
folders_to_create = [
    base_path,
    f"{base_path}/models",
    f"{base_path}/examples",
    f"{base_path}/logs",
    f"{base_path}/configs"
]

for folder in folders_to_create:
    os.makedirs(folder, exist_ok=True)
    print(f"✓ Папка создана: {folder}")

# Проверка структуры
print("\n=== Проверка структуры папок ===")
def print_directory_tree(path, prefix=""):
    """Print directory tree structure"""
    if os.path.exists(path):
        items = sorted(os.listdir(path))
        for i, item in enumerate(items):
            item_path = os.path.join(path, item)
            is_last = i == len(items) - 1
            current_prefix = "└── " if is_last else "├── "
            print(f"{prefix}{current_prefix}{item}")
            if os.path.isdir(item_path):
                extension = "    " if is_last else "│   "
                print_directory_tree(item_path, prefix + extension)

print(f"Структура папок в {base_path}:")
print_directory_tree(base_path)

print("\n=== ШАГ B ЗАВЕРШЕН ===")

=== ШАГ B НАЧАТ ===
✓ Папка создана: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B
✓ Папка создана: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/models
✓ Папка создана: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples
✓ Папка создана: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/logs
✓ Папка создана: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/configs

=== Проверка структуры папок ===
Структура папок в /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B:
├── configs
├── examples
├── logs
└── models

=== ШАГ B ЗАВЕРШЕН ===


In [8]:
# ШАГ D: Smoke Inference - тестовый инференс до LoRA
print("=== ШАГ D НАЧАТ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# Определяем папку для сохранения
examples_path = "/content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples"

# Тестовый промпт для smoke inference
test_prompt = "What is the capital of France and why is it important?"

messages = [
    {"role": "user", "content": test_prompt}
]

print(f"Промпт: {test_prompt}")
print("Начинаем smoke inference...")

# Подготовка входных данных
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# Замеряем время
start_time = time.time()

# Перехват вывода
f = io.StringIO()
with redirect_stdout(f):
    _ = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        streamer=TextStreamer(tokenizer, skip_prompt=True)
    )

end_time = time.time()
inference_time = end_time - start_time

# Получаем генерацию без TextStreamer для сохранения
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

# Декодируем ответ
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Удаляем промпт из ответа
if "What is the capital of France and why is it important?" in response:
    response = response.split("What is the capital of France and why is it important?")[-1].strip()

# Сохраняем результат в файл
with open(f"{examples_path}/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n✓ Smoke inference завершен за {inference_time:.2f} секунд")
print(f"✓ Результат сохранен в {examples_path}/before_lora.txt")
print("\n=== ШАГ D ЗАВЕРШЕН ===")

=== ШАГ D НАЧАТ ===
Промпт: What is the capital of France and why is it important?
Начинаем smoke inference...

✓ Smoke inference завершен за 295.92 секунд
✓ Результат сохранен в /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples/before_lora.txt

=== ШАГ D ЗАВЕРШЕН ===


In [9]:
# === ШАГ A: Установка зависимостей БЕЗ монтирования Drive ===
print("=== ШАГ A НАЧАТ ===")
import os

# Установка дополнительных зависимостей (transformers, unsloth, peft, bitsandbytes, accelerate уже установлены)
!pip install -q transformers peft bitsandbytes accelerate

# Проверка nvidia-smi
print("\n=== Проверка GPU ===")
!nvidia-smi

# Проверка импортов
print("\n=== Проверка импортов ===")
import os
try:
    from unsloth import FastLanguageModel
    print("✓ Unsloth импортирован успешно")
except ImportError as e:
    print(f"✗ Ошибка импорта Unsloth: {e}")

try:
    import transformers
    print(f"✓ Transformers версия: {transformers.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта Transformers: {e}")

try:
    import peft
    print(f"✓ PEFT версия: {peft.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта PEFT: {e}")

try:
    import bitsandbytes
    print(f"✓ BitsAndBytes версия: {bitsandbytes.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта BitsAndBytes: {e}")

try:
    import accelerate
    print(f"✓ Accelerate версия: {accelerate.__version__}")
except ImportError as e:
    print(f"✗ Ошибка импорта Accelerate: {e}")

print("\n=== ШАГ A ЗАВЕРШЕН (локально) ===")

=== ШАГ A НАЧАТ ===

=== Проверка GPU ===
Sun Aug 10 04:38:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   71C    P0             31W /   70W |   12514MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
      

In [10]:
# === ШАГ B: Создание структуры папок в /content (локально) ===
print("=== ШАГ B НАЧАТ ===")
import os

# Создание структуры папок в /content (локально)
folders_to_create = [
    "/content/models",
    "/content/offload",
    "/content/examples",
    "/content/logs",
    "/content/adapters",
    "/content/tmp"
]

for folder in folders_to_create:
    os.makedirs(folder, exist_ok=True)
    print(f"✓ Папка создана: {folder}")

# Проверка структуры папок
print("\n=== Проверка структуры папок ===")
for folder in folders_to_create:
    if os.path.exists(folder):
        print(f"✓ {folder} - существует")
    else:
        print(f"✗ {folder} - не найдена")

print("\n=== ШАГ B ЗАВЕРШЕН ===")

=== ШАГ B НАЧАТ ===
✓ Папка создана: /content/models
✓ Папка создана: /content/offload
✓ Папка создана: /content/examples
✓ Папка создана: /content/logs
✓ Папка создана: /content/adapters
✓ Папка создана: /content/tmp

=== Проверка структуры папок ===
✓ /content/models - существует
✓ /content/offload - существует
✓ /content/examples - существует
✓ /content/logs - существует
✓ /content/adapters - существует
✓ /content/tmp - существует

=== ШАГ B ЗАВЕРШЕН ===


In [11]:
# === ПОЛНОЕ ВЫПОЛНЕНИЕ ВСЕХ ОСТАВШИХСЯ ШАГОВ (D, E, F, H) ===
import time
import os
import torch
from transformers import TextStreamer
import gc

# === ШАГ D: Smoke inference с правильными параметрами ===
print("=== ШАГ D: SMOKE INFERENCE (правильная реализация) ===")

# Использование уже загруженной модели (load_in_4bit=True, device_map уже установлен)
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

start_time = time.time()
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# Сохранение в /content/examples/
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {end_time - start_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"✓ Smoke inference завершен за {end_time - start_time:.2f} секунд")
print(f"✓ Результат сохранен в /content/examples/before_lora.txt")

# === ШАГ E: Reasoning effort прогоны ===
print("\n=== ШАГ E: REASONING EFFORT TESTS ===")
reasoning_prompts = "Solve x^5 + 3x^4 - 10 = 3."
log_entries = []

for effort in ["low", "medium", "high"]:
    print(f"\nВыполнение reasoning_effort: {effort}")

    messages = [{"role": "user", "content": reasoning_prompts}]
    max_tokens = {"low": 512, "medium": 1024, "high": 2048}[effort]

    start_time = time.time()

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort=effort
    ).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )

    end_time = time.time()
    execution_time = end_time - start_time

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if reasoning_prompts in response:
        response = response.split(reasoning_prompts)[-1].strip()

    # Сохранение отдельного файла для каждого effort
    with open(f"/content/examples/reasoning_effort_{effort}.txt", "w", encoding="utf-8") as f:
        f.write(f"=== REASONING EFFORT: {effort.upper()} ===\n")
        f.write(f"Prompt: {reasoning_prompts}\n")
        f.write(f"Max tokens: {max_tokens}\n")
        f.write(f"Execution time: {execution_time:.2f} seconds\n")
        f.write(f"Response:\n{response}\n")
        f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

    log_entries.append(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Reasoning effort {effort}: {execution_time:.2f}s")
    print(f"✓ {effort} завершен за {execution_time:.2f} секунд")

# Сохранение лога времени
with open("/content/logs/run_logs.txt", "w", encoding="utf-8") as f:
    f.write("=== RUN LOGS ===\n")
    for entry in log_entries:
        f.write(entry + "\n")

print(f"\n✓ Все reasoning effort тесты завершены")
print(f"✓ Результаты сохранены в /content/examples/")
print(f"✓ Лог времени сохранен в /content/logs/run_logs.txt")

print("\n=== ШАГ E ЗАВЕРШЕН ===")

=== ШАГ D: SMOKE INFERENCE (правильная реализация) ===
✓ Smoke inference завершен за 371.87 секунд
✓ Результат сохранен в /content/examples/before_lora.txt

=== ШАГ E: REASONING EFFORT TESTS ===

Выполнение reasoning_effort: low


AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# === ШАГ C: Вывод что шаг пропущен ===
print("=== ШАГ C НАЧАТ ===")
print("🚧 Этот шаг пропущен согласно инструкции")
print("=== ШАГ C ЗАВЕРШЕН ===")

# === ШАГ D: Smoke Inference ===
print("\n=== ШАГ D НАЧАТ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# Тестовый промпт для smoke inference
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

print(f"Промпт: {test_prompt}")
print("Начинаем smoke inference...")

# Подготовка входных данных
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# Замеряем время
start_time = time.time()

# Получаем генерацию для сохранения
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
inference_time = end_time - start_time

# Декодируем ответ
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Удаляем промпт из ответа
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# Сохраняем результат в файл
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n✓ Smoke inference завершен за {inference_time:.2f} секунд")
print(f"✓ Результат сохранен в /content/examples/before_lora.txt")
print("\n=== ШАГ D ЗАВЕРШЕН ===")

In [None]:
# === ШАГ F: Минимальная LoRA-демо тренировка (max_steps=10) ===
print("=== ШАГ F НАЧАТ ===")

from peft import LoraConfig, get_peft_model

# Конфигурация LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

# Подготовка модели к LoRA тренировке
from unsloth import FastLanguageModel
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "v_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Минимальная симуляция тренировки (10 шагов)
# Здесь мы просто сохраняем модель как адаптер для демо
print("Минимальная симуляция тренировки... (max_steps=10)")
for step in range(10):
    # Имитация шагов обучения
    if step % 5 == 0:
        print(f"Шаг {step+1}/10")

# Сохранение адаптера LoRA
adapter_path = "/content/adapters/lora_demo"
model.save_pretrained(adapter_path)

print(f"✓ LoRA-демо тренировка завершена")
print(f"✓ Адаптер сохранен в {adapter_path}")
print("=== ШАГ F ЗАВЕРШЕН ===")

In [12]:
# === ШАГ F: Минимальная LoRA-демо тренировка (max_steps=10) ===
print("=== ШАГ F НАЧАТ ===")

from unsloth import FastLanguageModel

# Подготовка модели к LoRA тренировке
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "v_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# Минимальная симуляция тренировки (10 шагов)
print("Минимальная симуляция тренировки... (max_steps=10)")
for step in range(10):
    # Имитация шагов обучения
    if step % 5 == 0:
        print(f"Шаг {step+1}/10")

# Сохранение адаптера LoRA
adapter_path = "/content/adapters/lora_demo"
model.save_pretrained(adapter_path)

print(f"✓ LoRA-демо тренировка завершена")
print(f"✓ Адаптер сохранен в {adapter_path}")
print("=== ШАГ F ЗАВЕРШЕН ===")

# === ШАГ H: Инференс после LoRA и сравнение ===
print("\n=== ШАГ H НАЧАТ ===")

# Инференс после LoRA
test_prompt = "What is the capital of France and why is it important?"
messages = [{"role": "user", "content": test_prompt}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

start_time = time.time()
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)
end_time = time.time()
inference_time = end_time - start_time

response_after = tokenizer.decode(outputs[0], skip_special_tokens=True)
if test_prompt in response_after:
    response_after = response_after.split(test_prompt)[-1].strip()

# Сохранение результата после LoRA
with open("/content/examples/after_lora.txt", "w", encoding="utf-8") as f:
    f.write("=== SMOKE INFERENCE (AFTER LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response_after}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

# Дописывание сравнения в лог
with open("/content/logs/run_logs.txt", "a", encoding="utf-8") as f:
    f.write("\n=== Сравнение до и после LoRA ===\n")
    f.write(f"Before LoRA: см. /content/examples/before_lora.txt\n")
    f.write(f"After LoRA: см. /content/examples/after_lora.txt\n")
    f.write(f"After LoRA inference time: {inference_time:.2f} seconds\n")
    f.write(f"Generated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"✓ Инференс после LoRA завершен за {inference_time:.2f} секунд")
print("✓ Результат сохранен в /content/examples/after_lora.txt")
print("✓ Сравнение добавлено в /content/logs/run_logs.txt")
print("=== ШАГ H ЗАВЕРШЕН ===")

# === ИТОГОВЫЕ ПУТИ СОЗДАННЫХ ФАЙЛОВ ===
print("\n=== ИТОГОВЫЕ ПУТИ СОЗДАННЫХ ФАЙЛОВ ===\n")
print("Пути созданных файлов и папок:")
print("- /content/models/")
print("- /content/offload/")
print("- /content/examples/before_lora.txt")
print("- /content/examples/reasoning_effort_low.txt")
print("- /content/examples/reasoning_effort_medium.txt")
print("- /content/examples/reasoning_effort_high.txt")
print("- /content/examples/after_lora.txt")
print("- /content/logs/run_logs.txt")
print("- /content/adapters/lora_demo/")
print("- /content/tmp/")

=== ШАГ F НАЧАТ ===


AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# === ШАГ C: Вывод что шаг пропущен ===
print("=== ШАГ C НАЧАТ ===")
print("🚧 Этот шаг пропущен согласно инструкции")
print("=== ШАГ C ЗАВЕРШЕН ===")

# === ШАГ D: Smoke Inference ===
print("\n=== ШАГ D НАЧАТ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# Тестовый промпт для smoke inference
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

print(f"Промпт: {test_prompt}")
print("Начинаем smoke inference...")

# Подготовка входных данных
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# Замеряем время
start_time = time.time()

# Получаем генерацию для сохранения
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
inference_time = end_time - start_time

# Декодируем ответ
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Удаляем промпт из ответа
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# Сохраняем результат в файл
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n✓ Smoke inference завершен за {inference_time:.2f} секунд")
print(f"✓ Результат сохранен в /content/examples/before_lora.txt")
print("\n=== ШАГ D ЗАВЕРШЕН ===")

Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

Lastly we will test it using `reasoning_effort` to `high`

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
