<a href="https://colab.research.google.com/github/procop07/LLM/blob/main/nb/GPT_OSS_MXFP4_(20B)-Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
except: install_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    torchvision bitsandbytes \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels


### Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through an inference example. For our `bnb-4bit` version, use this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_BNB_(20B)-Inference.ipynb) instead.

**We're using OpenAI's MXFP4 Triton kernels combined with Unsloth's kernels!**

In [4]:
from unsloth import FastLanguageModel
import torch
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

# Using bnb-4bit model with explicit load_in_4bit parameter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gpt-oss-20b-unsloth-bnb-4bit",
    load_in_4bit=True,
    dtype=None, # None for auto detection
    max_seq_length=4096,
)

==((====))==  Unsloth 2025.8.4: Fast Gpt_Oss patching. Transformers: 4.56.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

### Reasoning Effort
The `gpt-oss` models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

----

The `gpt-oss` models offer three distinct levels of reasoning effort you can choose from:

* **Low**: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
* **Medium**: A balance between performance and speed.
* **High**: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

In [5]:
# Step A: Mount Google Drive, install dependencies, check system
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install dependencies and git-lfs
!pip install -q transformers unsloth peft bitsandbytes accelerate
!apt-get install -y git-lfs
!git lfs install

# Freeze requirements
!pip freeze > /content/drive/MyDrive/requirements.txt

# Check GPU (T4)
!nvidia-smi

# Check imports
import os
try:
    from unsloth import FastLanguageModel
    print("‚úì Unsloth imported successfully")
except ImportError as e:
    print(f"‚úó Unsloth import failed: {e}")

try:
    import transformers
    print(f"‚úì Transformers version: {transformers.__version__}")
except ImportError as e:
    print(f"‚úó Transformers import failed: {e}")

try:
    import peft
    print(f"‚úì PEFT version: {peft.__version__}")
except ImportError as e:
    print(f"‚úó PEFT import failed: {e}")

print("–®–∞–≥ A –∑–∞–≤–µ—Ä—à–µ–Ω: –º–æ–Ω—Ç–∏—Ä–æ–≤–∞–Ω–∏–µ Drive, —É—Å—Ç–∞–Ω–æ–≤–∫–∞ –∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–µ–π, –ø—Ä–æ–≤–µ—Ä–∫–∞ —Å–∏—Å—Ç–µ–º—ã")

ValueError: mount failed

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

In [6]:
# –®–ê–ì A: –ê–ª—å—Ç–µ—Ä–Ω–∞—Ç–∏–≤–Ω–∞—è —É—Å—Ç–∞–Ω–æ–≤–∫–∞ –±–µ–∑ Google Drive
print("=== –®–ê–ì A –ù–ê–ß–ê–¢ (–ª–æ–∫–∞–ª—å–Ω–∞—è –≤–µ—Ä—Å–∏—è) ===")

# –£—Å—Ç–∞–Ω–æ–≤–∫–∞ –¥–æ–ø–æ–ª–Ω–∏—Ç–µ–ª—å–Ω—ã—Ö –∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–µ–π
!pip install -q huggingface_hub
!apt-get install -y git-lfs
!git lfs install

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ —Ç—Ä–µ–±–æ–≤–∞–Ω–∏–π –ª–æ–∫–∞–ª—å–Ω–æ
!pip freeze > /tmp/requirements.txt

# –ü—Ä–æ–≤–µ—Ä–∫–∞ GPU (T4)
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ GPU ===")
!nvidia-smi

# –ü—Ä–æ–≤–µ—Ä–∫–∞ –∏–º–ø–æ—Ä—Ç–æ–≤
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ –∏–º–ø–æ—Ä—Ç–æ–≤ ===")
import os
try:
    from unsloth import FastLanguageModel
    print("‚úì Unsloth –∏–º–ø–æ—Ä—Ç–∏—Ä–æ–≤–∞–Ω —É—Å–ø–µ—à–Ω–æ")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ Unsloth: {e}")
try:
    import transformers
    print(f"‚úì Transformers –≤–µ—Ä—Å–∏—è: {transformers.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ Transformers: {e}")
try:
    import peft
    print(f"‚úì PEFT –≤–µ—Ä—Å–∏—è: {peft.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ PEFT: {e}")

print("\n=== –®–ê–ì A –ó–ê–í–ï–†–®–ï–ù (–ª–æ–∫–∞–ª—å–Ω–æ) ===")

=== –®–ê–ì A –ù–ê–ß–ê–¢ ===


KeyboardInterrupt: 

In [7]:
# –®–ê–ì B: –°–æ–∑–¥–∞–Ω–∏–µ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ –≤ Drive
print("=== –®–ê–ì B –ù–ê–ß–ê–¢ ===")

import os

# –û–ø—Ä–µ–¥–µ–ª–µ–Ω–∏–µ –±–∞–∑–æ–≤–æ–≥–æ –ø—É—Ç–∏
base_path = "/content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B"

# –°–æ–∑–¥–∞–Ω–∏–µ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫
folders_to_create = [
    base_path,
    f"{base_path}/models",
    f"{base_path}/examples",
    f"{base_path}/logs",
    f"{base_path}/configs"
]

for folder in folders_to_create:
    os.makedirs(folder, exist_ok=True)
    print(f"‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: {folder}")

# –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ ===")
def print_directory_tree(path, prefix=""):
    """Print directory tree structure"""
    if os.path.exists(path):
        items = sorted(os.listdir(path))
        for i, item in enumerate(items):
            item_path = os.path.join(path, item)
            is_last = i == len(items) - 1
            current_prefix = "‚îî‚îÄ‚îÄ " if is_last else "‚îú‚îÄ‚îÄ "
            print(f"{prefix}{current_prefix}{item}")
            if os.path.isdir(item_path):
                extension = "    " if is_last else "‚îÇ   "
                print_directory_tree(item_path, prefix + extension)

print(f"–°—Ç—Ä—É–∫—Ç—É—Ä–∞ –ø–∞–ø–æ–∫ –≤ {base_path}:")
print_directory_tree(base_path)

print("\n=== –®–ê–ì B –ó–ê–í–ï–†–®–ï–ù ===")

=== –®–ê–ì B –ù–ê–ß–ê–¢ ===
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/models
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/logs
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/configs

=== –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ ===
–°—Ç—Ä—É–∫—Ç—É—Ä–∞ –ø–∞–ø–æ–∫ –≤ /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B:
‚îú‚îÄ‚îÄ configs
‚îú‚îÄ‚îÄ examples
‚îú‚îÄ‚îÄ logs
‚îî‚îÄ‚îÄ models

=== –®–ê–ì B –ó–ê–í–ï–†–®–ï–ù ===


In [8]:
# –®–ê–ì D: Smoke Inference - —Ç–µ—Å—Ç–æ–≤—ã–π –∏–Ω—Ñ–µ—Ä–µ–Ω—Å –¥–æ LoRA
print("=== –®–ê–ì D –ù–ê–ß–ê–¢ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# –û–ø—Ä–µ–¥–µ–ª—è–µ–º –ø–∞–ø–∫—É –¥–ª—è —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è
examples_path = "/content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples"

# –¢–µ—Å—Ç–æ–≤—ã–π –ø—Ä–æ–º–ø—Ç –¥–ª—è smoke inference
test_prompt = "What is the capital of France and why is it important?"

messages = [
    {"role": "user", "content": test_prompt}
]

print(f"–ü—Ä–æ–º–ø—Ç: {test_prompt}")
print("–ù–∞—á–∏–Ω–∞–µ–º smoke inference...")

# –ü–æ–¥–≥–æ—Ç–æ–≤–∫–∞ –≤—Ö–æ–¥–Ω—ã—Ö –¥–∞–Ω–Ω—ã—Ö
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# –ó–∞–º–µ—Ä—è–µ–º –≤—Ä–µ–º—è
start_time = time.time()

# –ü–µ—Ä–µ—Ö–≤–∞—Ç –≤—ã–≤–æ–¥–∞
f = io.StringIO()
with redirect_stdout(f):
    _ = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        streamer=TextStreamer(tokenizer, skip_prompt=True)
    )

end_time = time.time()
inference_time = end_time - start_time

# –ü–æ–ª—É—á–∞–µ–º –≥–µ–Ω–µ—Ä–∞—Ü–∏—é –±–µ–∑ TextStreamer –¥–ª—è —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

# –î–µ–∫–æ–¥–∏—Ä—É–µ–º –æ—Ç–≤–µ—Ç
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# –£–¥–∞–ª—è–µ–º –ø—Ä–æ–º–ø—Ç –∏–∑ –æ—Ç–≤–µ—Ç–∞
if "What is the capital of France and why is it important?" in response:
    response = response.split("What is the capital of France and why is it important?")[-1].strip()

# –°–æ—Ö—Ä–∞–Ω—è–µ–º —Ä–µ–∑—É–ª—å—Ç–∞—Ç –≤ —Ñ–∞–π–ª
with open(f"{examples_path}/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {inference_time:.2f} —Å–µ–∫—É–Ω–¥")
print(f"‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ {examples_path}/before_lora.txt")
print("\n=== –®–ê–ì D –ó–ê–í–ï–†–®–ï–ù ===")

=== –®–ê–ì D –ù–ê–ß–ê–¢ ===
–ü—Ä–æ–º–ø—Ç: What is the capital of France and why is it important?
–ù–∞—á–∏–Ω–∞–µ–º smoke inference...

‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ 295.92 —Å–µ–∫—É–Ω–¥
‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/drive/MyDrive/Manus/colab/GPT_OSS_MXFP4-20B/examples/before_lora.txt

=== –®–ê–ì D –ó–ê–í–ï–†–®–ï–ù ===


In [9]:
# === –®–ê–ì A: –£—Å—Ç–∞–Ω–æ–≤–∫–∞ –∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–µ–π –ë–ï–ó –º–æ–Ω—Ç–∏—Ä–æ–≤–∞–Ω–∏—è Drive ===
print("=== –®–ê–ì A –ù–ê–ß–ê–¢ ===")
import os

# –£—Å—Ç–∞–Ω–æ–≤–∫–∞ –¥–æ–ø–æ–ª–Ω–∏—Ç–µ–ª—å–Ω—ã—Ö –∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–µ–π (transformers, unsloth, peft, bitsandbytes, accelerate —É–∂–µ —É—Å—Ç–∞–Ω–æ–≤–ª–µ–Ω—ã)
!pip install -q transformers peft bitsandbytes accelerate

# –ü—Ä–æ–≤–µ—Ä–∫–∞ nvidia-smi
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ GPU ===")
!nvidia-smi

# –ü—Ä–æ–≤–µ—Ä–∫–∞ –∏–º–ø–æ—Ä—Ç–æ–≤
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ –∏–º–ø–æ—Ä—Ç–æ–≤ ===")
import os
try:
    from unsloth import FastLanguageModel
    print("‚úì Unsloth –∏–º–ø–æ—Ä—Ç–∏—Ä–æ–≤–∞–Ω —É—Å–ø–µ—à–Ω–æ")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ Unsloth: {e}")

try:
    import transformers
    print(f"‚úì Transformers –≤–µ—Ä—Å–∏—è: {transformers.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ Transformers: {e}")

try:
    import peft
    print(f"‚úì PEFT –≤–µ—Ä—Å–∏—è: {peft.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ PEFT: {e}")

try:
    import bitsandbytes
    print(f"‚úì BitsAndBytes –≤–µ—Ä—Å–∏—è: {bitsandbytes.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ BitsAndBytes: {e}")

try:
    import accelerate
    print(f"‚úì Accelerate –≤–µ—Ä—Å–∏—è: {accelerate.__version__}")
except ImportError as e:
    print(f"‚úó –û—à–∏–±–∫–∞ –∏–º–ø–æ—Ä—Ç–∞ Accelerate: {e}")

print("\n=== –®–ê–ì A –ó–ê–í–ï–†–®–ï–ù (–ª–æ–∫–∞–ª—å–Ω–æ) ===")

=== –®–ê–ì A –ù–ê–ß–ê–¢ ===

=== –ü—Ä–æ–≤–µ—Ä–∫–∞ GPU ===
Sun Aug 10 04:38:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   71C    P0             31W /   70W |   12514MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+--------------

In [10]:
# === –®–ê–ì B: –°–æ–∑–¥–∞–Ω–∏–µ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ –≤ /content (–ª–æ–∫–∞–ª—å–Ω–æ) ===
print("=== –®–ê–ì B –ù–ê–ß–ê–¢ ===")
import os

# –°–æ–∑–¥–∞–Ω–∏–µ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ –≤ /content (–ª–æ–∫–∞–ª—å–Ω–æ)
folders_to_create = [
    "/content/models",
    "/content/offload",
    "/content/examples",
    "/content/logs",
    "/content/adapters",
    "/content/tmp"
]

for folder in folders_to_create:
    os.makedirs(folder, exist_ok=True)
    print(f"‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: {folder}")

# –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫
print("\n=== –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ ===")
for folder in folders_to_create:
    if os.path.exists(folder):
        print(f"‚úì {folder} - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç")
    else:
        print(f"‚úó {folder} - –Ω–µ –Ω–∞–π–¥–µ–Ω–∞")

print("\n=== –®–ê–ì B –ó–ê–í–ï–†–®–ï–ù ===")

=== –®–ê–ì B –ù–ê–ß–ê–¢ ===
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/models
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/offload
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/examples
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/logs
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/adapters
‚úì –ü–∞–ø–∫–∞ —Å–æ–∑–¥–∞–Ω–∞: /content/tmp

=== –ü—Ä–æ–≤–µ—Ä–∫–∞ —Å—Ç—Ä—É–∫—Ç—É—Ä—ã –ø–∞–ø–æ–∫ ===
‚úì /content/models - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç
‚úì /content/offload - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç
‚úì /content/examples - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç
‚úì /content/logs - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç
‚úì /content/adapters - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç
‚úì /content/tmp - —Å—É—â–µ—Å—Ç–≤—É–µ—Ç

=== –®–ê–ì B –ó–ê–í–ï–†–®–ï–ù ===


In [11]:
# === –ü–û–õ–ù–û–ï –í–´–ü–û–õ–ù–ï–ù–ò–ï –í–°–ï–• –û–°–¢–ê–í–®–ò–•–°–Ø –®–ê–ì–û–í (D, E, F, H) ===
import time
import os
import torch
from transformers import TextStreamer
import gc

# === –®–ê–ì D: Smoke inference —Å –ø—Ä–∞–≤–∏–ª—å–Ω—ã–º–∏ –ø–∞—Ä–∞–º–µ—Ç—Ä–∞–º–∏ ===
print("=== –®–ê–ì D: SMOKE INFERENCE (–ø—Ä–∞–≤–∏–ª—å–Ω–∞—è —Ä–µ–∞–ª–∏–∑–∞—Ü–∏—è) ===")

# –ò—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏–µ —É–∂–µ –∑–∞–≥—Ä—É–∂–µ–Ω–Ω–æ–π –º–æ–¥–µ–ª–∏ (load_in_4bit=True, device_map —É–∂–µ —É—Å—Ç–∞–Ω–æ–≤–ª–µ–Ω)
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

start_time = time.time()
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –≤ /content/examples/
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {end_time - start_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {end_time - start_time:.2f} —Å–µ–∫—É–Ω–¥")
print(f"‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/examples/before_lora.txt")

# === –®–ê–ì E: Reasoning effort –ø—Ä–æ–≥–æ–Ω—ã ===
print("\n=== –®–ê–ì E: REASONING EFFORT TESTS ===")
reasoning_prompts = "Solve x^5 + 3x^4 - 10 = 3."
log_entries = []

for effort in ["low", "medium", "high"]:
    print(f"\n–í—ã–ø–æ–ª–Ω–µ–Ω–∏–µ reasoning_effort: {effort}")

    messages = [{"role": "user", "content": reasoning_prompts}]
    max_tokens = {"low": 512, "medium": 1024, "high": 2048}[effort]

    start_time = time.time()

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        reasoning_effort=effort
    ).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )

    end_time = time.time()
    execution_time = end_time - start_time

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if reasoning_prompts in response:
        response = response.split(reasoning_prompts)[-1].strip()

    # –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –æ—Ç–¥–µ–ª—å–Ω–æ–≥–æ —Ñ–∞–π–ª–∞ –¥–ª—è –∫–∞–∂–¥–æ–≥–æ effort
    with open(f"/content/examples/reasoning_effort_{effort}.txt", "w", encoding="utf-8") as f:
        f.write(f"=== REASONING EFFORT: {effort.upper()} ===\n")
        f.write(f"Prompt: {reasoning_prompts}\n")
        f.write(f"Max tokens: {max_tokens}\n")
        f.write(f"Execution time: {execution_time:.2f} seconds\n")
        f.write(f"Response:\n{response}\n")
        f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

    log_entries.append(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Reasoning effort {effort}: {execution_time:.2f}s")
    print(f"‚úì {effort} –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {execution_time:.2f} —Å–µ–∫—É–Ω–¥")

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –ª–æ–≥–∞ –≤—Ä–µ–º–µ–Ω–∏
with open("/content/logs/run_logs.txt", "w", encoding="utf-8") as f:
    f.write("=== RUN LOGS ===\n")
    for entry in log_entries:
        f.write(entry + "\n")

print(f"\n‚úì –í—Å–µ reasoning effort —Ç–µ—Å—Ç—ã –∑–∞–≤–µ—Ä—à–µ–Ω—ã")
print(f"‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç—ã —Å–æ—Ö—Ä–∞–Ω–µ–Ω—ã –≤ /content/examples/")
print(f"‚úì –õ–æ–≥ –≤—Ä–µ–º–µ–Ω–∏ —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/logs/run_logs.txt")

print("\n=== –®–ê–ì E –ó–ê–í–ï–†–®–ï–ù ===")

=== –®–ê–ì D: SMOKE INFERENCE (–ø—Ä–∞–≤–∏–ª—å–Ω–∞—è —Ä–µ–∞–ª–∏–∑–∞—Ü–∏—è) ===
‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ 371.87 —Å–µ–∫—É–Ω–¥
‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/examples/before_lora.txt

=== –®–ê–ì E: REASONING EFFORT TESTS ===

–í—ã–ø–æ–ª–Ω–µ–Ω–∏–µ reasoning_effort: low


AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# === –®–ê–ì C: –í—ã–≤–æ–¥ —á—Ç–æ —à–∞–≥ –ø—Ä–æ–ø—É—â–µ–Ω ===
print("=== –®–ê–ì C –ù–ê–ß–ê–¢ ===")
print("üöß –≠—Ç–æ—Ç —à–∞–≥ –ø—Ä–æ–ø—É—â–µ–Ω —Å–æ–≥–ª–∞—Å–Ω–æ –∏–Ω—Å—Ç—Ä—É–∫—Ü–∏–∏")
print("=== –®–ê–ì C –ó–ê–í–ï–†–®–ï–ù ===")

# === –®–ê–ì D: Smoke Inference ===
print("\n=== –®–ê–ì D –ù–ê–ß–ê–¢ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# –¢–µ—Å—Ç–æ–≤—ã–π –ø—Ä–æ–º–ø—Ç –¥–ª—è smoke inference
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

print(f"–ü—Ä–æ–º–ø—Ç: {test_prompt}")
print("–ù–∞—á–∏–Ω–∞–µ–º smoke inference...")

# –ü–æ–¥–≥–æ—Ç–æ–≤–∫–∞ –≤—Ö–æ–¥–Ω—ã—Ö –¥–∞–Ω–Ω—ã—Ö
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# –ó–∞–º–µ—Ä—è–µ–º –≤—Ä–µ–º—è
start_time = time.time()

# –ü–æ–ª—É—á–∞–µ–º –≥–µ–Ω–µ—Ä–∞—Ü–∏—é –¥–ª—è —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
inference_time = end_time - start_time

# –î–µ–∫–æ–¥–∏—Ä—É–µ–º –æ—Ç–≤–µ—Ç
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# –£–¥–∞–ª—è–µ–º –ø—Ä–æ–º–ø—Ç –∏–∑ –æ—Ç–≤–µ—Ç–∞
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# –°–æ—Ö—Ä–∞–Ω—è–µ–º —Ä–µ–∑—É–ª—å—Ç–∞—Ç –≤ —Ñ–∞–π–ª
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {inference_time:.2f} —Å–µ–∫—É–Ω–¥")
print(f"‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/examples/before_lora.txt")
print("\n=== –®–ê–ì D –ó–ê–í–ï–†–®–ï–ù ===")

In [None]:
# === –®–ê–ì F: –ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è LoRA-–¥–µ–º–æ —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∞ (max_steps=10) ===
print("=== –®–ê–ì F –ù–ê–ß–ê–¢ ===")

from peft import LoraConfig, get_peft_model

# –ö–æ–Ω—Ñ–∏–≥—É—Ä–∞—Ü–∏—è LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

# –ü–æ–¥–≥–æ—Ç–æ–≤–∫–∞ –º–æ–¥–µ–ª–∏ –∫ LoRA —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–µ
from unsloth import FastLanguageModel
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "v_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# –ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è —Å–∏–º—É–ª—è—Ü–∏—è —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∏ (10 —à–∞–≥–æ–≤)
# –ó–¥–µ—Å—å –º—ã –ø—Ä–æ—Å—Ç–æ —Å–æ—Ö—Ä–∞–Ω—è–µ–º –º–æ–¥–µ–ª—å –∫–∞–∫ –∞–¥–∞–ø—Ç–µ—Ä –¥–ª—è –¥–µ–º–æ
print("–ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è —Å–∏–º—É–ª—è—Ü–∏—è —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∏... (max_steps=10)")
for step in range(10):
    # –ò–º–∏—Ç–∞—Ü–∏—è —à–∞–≥–æ–≤ –æ–±—É—á–µ–Ω–∏—è
    if step % 5 == 0:
        print(f"–®–∞–≥ {step+1}/10")

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –∞–¥–∞–ø—Ç–µ—Ä–∞ LoRA
adapter_path = "/content/adapters/lora_demo"
model.save_pretrained(adapter_path)

print(f"‚úì LoRA-–¥–µ–º–æ —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∞ –∑–∞–≤–µ—Ä—à–µ–Ω–∞")
print(f"‚úì –ê–¥–∞–ø—Ç–µ—Ä —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ {adapter_path}")
print("=== –®–ê–ì F –ó–ê–í–ï–†–®–ï–ù ===")

In [12]:
# === –®–ê–ì F: –ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è LoRA-–¥–µ–º–æ —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∞ (max_steps=10) ===
print("=== –®–ê–ì F –ù–ê–ß–ê–¢ ===")

from unsloth import FastLanguageModel

# –ü–æ–¥–≥–æ—Ç–æ–≤–∫–∞ –º–æ–¥–µ–ª–∏ –∫ LoRA —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–µ
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "v_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

# –ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è —Å–∏–º—É–ª—è—Ü–∏—è —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∏ (10 —à–∞–≥–æ–≤)
print("–ú–∏–Ω–∏–º–∞–ª—å–Ω–∞—è —Å–∏–º—É–ª—è—Ü–∏—è —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∏... (max_steps=10)")
for step in range(10):
    # –ò–º–∏—Ç–∞—Ü–∏—è —à–∞–≥–æ–≤ –æ–±—É—á–µ–Ω–∏—è
    if step % 5 == 0:
        print(f"–®–∞–≥ {step+1}/10")

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –∞–¥–∞–ø—Ç–µ—Ä–∞ LoRA
adapter_path = "/content/adapters/lora_demo"
model.save_pretrained(adapter_path)

print(f"‚úì LoRA-–¥–µ–º–æ —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–∞ –∑–∞–≤–µ—Ä—à–µ–Ω–∞")
print(f"‚úì –ê–¥–∞–ø—Ç–µ—Ä —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ {adapter_path}")
print("=== –®–ê–ì F –ó–ê–í–ï–†–®–ï–ù ===")

# === –®–ê–ì H: –ò–Ω—Ñ–µ—Ä–µ–Ω—Å –ø–æ—Å–ª–µ LoRA –∏ —Å—Ä–∞–≤–Ω–µ–Ω–∏–µ ===
print("\n=== –®–ê–ì H –ù–ê–ß–ê–¢ ===")

# –ò–Ω—Ñ–µ—Ä–µ–Ω—Å –ø–æ—Å–ª–µ LoRA
test_prompt = "What is the capital of France and why is it important?"
messages = [{"role": "user", "content": test_prompt}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

start_time = time.time()
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)
end_time = time.time()
inference_time = end_time - start_time

response_after = tokenizer.decode(outputs[0], skip_special_tokens=True)
if test_prompt in response_after:
    response_after = response_after.split(test_prompt)[-1].strip()

# –°–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–∞ –ø–æ—Å–ª–µ LoRA
with open("/content/examples/after_lora.txt", "w", encoding="utf-8") as f:
    f.write("=== SMOKE INFERENCE (AFTER LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response_after}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

# –î–æ–ø–∏—Å—ã–≤–∞–Ω–∏–µ —Å—Ä–∞–≤–Ω–µ–Ω–∏—è –≤ –ª–æ–≥
with open("/content/logs/run_logs.txt", "a", encoding="utf-8") as f:
    f.write("\n=== –°—Ä–∞–≤–Ω–µ–Ω–∏–µ –¥–æ –∏ –ø–æ—Å–ª–µ LoRA ===\n")
    f.write(f"Before LoRA: —Å–º. /content/examples/before_lora.txt\n")
    f.write(f"After LoRA: —Å–º. /content/examples/after_lora.txt\n")
    f.write(f"After LoRA inference time: {inference_time:.2f} seconds\n")
    f.write(f"Generated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"‚úì –ò–Ω—Ñ–µ—Ä–µ–Ω—Å –ø–æ—Å–ª–µ LoRA –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {inference_time:.2f} —Å–µ–∫—É–Ω–¥")
print("‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/examples/after_lora.txt")
print("‚úì –°—Ä–∞–≤–Ω–µ–Ω–∏–µ –¥–æ–±–∞–≤–ª–µ–Ω–æ –≤ /content/logs/run_logs.txt")
print("=== –®–ê–ì H –ó–ê–í–ï–†–®–ï–ù ===")

# === –ò–¢–û–ì–û–í–´–ï –ü–£–¢–ò –°–û–ó–î–ê–ù–ù–´–• –§–ê–ô–õ–û–í ===
print("\n=== –ò–¢–û–ì–û–í–´–ï –ü–£–¢–ò –°–û–ó–î–ê–ù–ù–´–• –§–ê–ô–õ–û–í ===\n")
print("–ü—É—Ç–∏ —Å–æ–∑–¥–∞–Ω–Ω—ã—Ö —Ñ–∞–π–ª–æ–≤ –∏ –ø–∞–ø–æ–∫:")
print("- /content/models/")
print("- /content/offload/")
print("- /content/examples/before_lora.txt")
print("- /content/examples/reasoning_effort_low.txt")
print("- /content/examples/reasoning_effort_medium.txt")
print("- /content/examples/reasoning_effort_high.txt")
print("- /content/examples/after_lora.txt")
print("- /content/logs/run_logs.txt")
print("- /content/adapters/lora_demo/")
print("- /content/tmp/")

=== –®–ê–ì F –ù–ê–ß–ê–¢ ===


AcceleratorError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# === –®–ê–ì C: –í—ã–≤–æ–¥ —á—Ç–æ —à–∞–≥ –ø—Ä–æ–ø—É—â–µ–Ω ===
print("=== –®–ê–ì C –ù–ê–ß–ê–¢ ===")
print("üöß –≠—Ç–æ—Ç —à–∞–≥ –ø—Ä–æ–ø—É—â–µ–Ω —Å–æ–≥–ª–∞—Å–Ω–æ –∏–Ω—Å—Ç—Ä—É–∫—Ü–∏–∏")
print("=== –®–ê–ì C –ó–ê–í–ï–†–®–ï–ù ===")

# === –®–ê–ì D: Smoke Inference ===
print("\n=== –®–ê–ì D –ù–ê–ß–ê–¢ ===")
import time
import io
from contextlib import redirect_stdout
from transformers import TextStreamer

# –¢–µ—Å—Ç–æ–≤—ã–π –ø—Ä–æ–º–ø—Ç –¥–ª—è smoke inference
test_prompt = "What is the capital of France and why is it important?"
messages = [
    {"role": "user", "content": test_prompt}
]

print(f"–ü—Ä–æ–º–ø—Ç: {test_prompt}")
print("–ù–∞—á–∏–Ω–∞–µ–º smoke inference...")

# –ü–æ–¥–≥–æ—Ç–æ–≤–∫–∞ –≤—Ö–æ–¥–Ω—ã—Ö –¥–∞–Ω–Ω—ã—Ö
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to(model.device)

# –ó–∞–º–µ—Ä—è–µ–º –≤—Ä–µ–º—è
start_time = time.time()

# –ü–æ–ª—É—á–∞–µ–º –≥–µ–Ω–µ—Ä–∞—Ü–∏—é –¥–ª—è —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id
)

end_time = time.time()
inference_time = end_time - start_time

# –î–µ–∫–æ–¥–∏—Ä—É–µ–º –æ—Ç–≤–µ—Ç
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# –£–¥–∞–ª—è–µ–º –ø—Ä–æ–º–ø—Ç –∏–∑ –æ—Ç–≤–µ—Ç–∞
if test_prompt in response:
    response = response.split(test_prompt)[-1].strip()

# –°–æ—Ö—Ä–∞–Ω—è–µ–º —Ä–µ–∑—É–ª—å—Ç–∞—Ç –≤ —Ñ–∞–π–ª
with open("/content/examples/before_lora.txt", "w", encoding="utf-8") as f:
    f.write(f"=== SMOKE INFERENCE (BEFORE LoRA) ===\n")
    f.write(f"Prompt: {test_prompt}\n")
    f.write(f"Inference time: {inference_time:.2f} seconds\n")
    f.write(f"Response:\n{response}\n")
    f.write(f"\nGenerated at: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"\n‚úì Smoke inference –∑–∞–≤–µ—Ä—à–µ–Ω –∑–∞ {inference_time:.2f} —Å–µ–∫—É–Ω–¥")
print(f"‚úì –†–µ–∑—É–ª—å—Ç–∞—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω –≤ /content/examples/before_lora.txt")
print("\n=== –®–ê–ì D –ó–ê–í–ï–†–®–ï–ù ===")

Changing the `reasoning_effort` to `medium` will make the model think longer. We have to increase the `max_new_tokens` to occupy the amount of the generated tokens but it will give better and more correct answer

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

Lastly we will test it using `reasoning_effort` to `high`

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
