<a href="https://colab.research.google.com/github/kostianlomit/habr_bot/blob/master/qwen3__4b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth


In [2]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

–î–æ–±–∞–≤–∏–ª –∞–¥–∞–ø—Ç–µ—Ä—ã LoRa, –ø–æ—ç—Ç–æ–º—É –Ω–∞–º –Ω—É–∂–Ω–æ –æ–±–Ω–æ–≤–∏—Ç—å –ª–∏—à—å –Ω–µ–±–æ–ª—å—à–æ–µ –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ –ø–∞—Ä–∞–º–µ—Ç—Ä–æ–≤!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.12.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<a name="Data"></a>
### Data Prep
–¢–µ–ø–µ—Ä—å –º—ã –∏—Å–ø–æ–ª—å–∑—É–µ–º —Ñ–æ—Ä–º–∞—Ç Qwen-3 –¥–ª—è –Ω–∞—Å—Ç—Ä–æ–π–∫–∏ —Å—Ç–∏–ª—è —Ä–∞–∑–≥–æ–≤–æ—Ä–∞. –ú—ã –∏—Å–ø–æ–ª—å–∑—É–µ–º –Ω–∞–±–æ—Ä –¥–∞–Ω–Ω—ã—Ö FineTome-100k –ú–∞–∫—Å–∏–º–∞ –õ–∞–±–æ–Ω–Ω–∞ –≤ —Å—Ç–∏–ª–µ ShareGPT. Qwen-3 –æ—Ç–æ–±—Ä–∞–∂–∞–µ—Ç –º–Ω–æ–≥–æ–æ–±–æ—Ä–æ—Ç–Ω—ã–µ –¥–∏–∞–ª–æ–≥–∏, –∫–∞–∫ –ø–æ–∫–∞–∑–∞–Ω–æ –Ω–∏–∂–µ:

```
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hey there!<|im_end|>

```
–ú—ã –∏—Å–ø–æ–ª—å–∑—É–µ–º –Ω–∞—à—É —Ñ—É–Ω–∫—Ü–∏—é get_chat_template, —á—Ç–æ–±—ã –ø–æ–ª—É—á–∏—Ç—å –ø—Ä–∞–≤–∏–ª—å–Ω—ã–π —à–∞–±–ª–æ–Ω —á–∞—Ç–∞. –ú—ã –ø–æ–¥–¥–µ—Ä–∂–∏–≤–∞–µ–º zephyr, chatml, mistral, –ª–∞–º—É, –∞–ª—å–ø–∞–∫—É, –≤–∏–∫—É–Ω—å—é, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3 –∏ –¥—Ä—É–≥–∏–µ.

In [6]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# 1. –ó–∞–≥—Ä—É–∑–∫–∞ –≤–∞—à–µ–≥–æ –¥–∞—Ç–∞—Å–µ—Ç–∞
dataset = load_dataset("json", data_files="sample_data/unsloth_gds_dataset.json", split="train")

# 2. –ü—Ä–∏–º–µ–Ω–µ–Ω–∏–µ —à–∞–±–ª–æ–Ω–∞ —á–∞—Ç–∞ –∫ —Ç–æ–∫–µ–Ω–∏–∑–∞—Ç–æ—Ä—É
tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen3-instruct",
)

# 3. –ü—Ä–∏–º–µ–Ω–µ–Ω–∏–µ —à–∞–±–ª–æ–Ω–∞ —á–∞—Ç–∞ –∫ –¥–∞–Ω–Ω—ã–º –¥–∞—Ç–∞—Å–µ—Ç–∞
dataset = dataset.map(lambda x: {"text": tokenizer.apply_chat_template(x["conversations"], tokenize=False)})

# 4. –ü—Ä–æ–≤–µ—Ä–∫–∞ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–∞
for i in range(3):
    print(f"–ü—Ä–∏–º–µ—Ä {i+1}:")
    print(dataset[i]["text"])
    print("-" * 80)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]

–ü—Ä–∏–º–µ—Ä 1:

--------------------------------------------------------------------------------
–ü—Ä–∏–º–µ—Ä 2:

--------------------------------------------------------------------------------
–ü—Ä–∏–º–µ—Ä 3:

--------------------------------------------------------------------------------


–¢–µ–ø–µ—Ä—å –º—ã –∏—Å–ø–æ–ª—å–∑—É–µ–º standardize_data_formats, —á—Ç–æ–±—ã –ø–æ–ø—ã—Ç–∞—Ç—å—Å—è –ø—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞—Ç—å –Ω–∞–±–æ—Ä—ã –¥–∞–Ω–Ω—ã—Ö –≤ –ø—Ä–∞–≤–∏–ª—å–Ω—ã–π —Ñ–æ—Ä–º–∞—Ç –¥–ª—è —Ü–µ–ª–µ–π —Ç–æ—á–Ω–æ–π –Ω–∞—Å—Ç—Ä–æ–π–∫–∏!

In [7]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/216 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

In [8]:
dataset[100]

{'conversations': [{'content': '–ö–ª–∞—Å—Å–∏—Ñ–∏—Ü–∏—Ä—É–π –∑–∞–ø—Ä–æ—Å –æ–ø–µ—Ä–∞—Ç–æ—Ä–∞ GDS –∏ –≤–µ—Ä–Ω–∏ –ø–æ—Å–ª–µ–¥–æ–≤–∞—Ç–µ–ª—å–Ω–æ—Å—Ç—å –∫–æ–º–∞–Ω–¥\n\nAlfaTravel::–°—Ä–æ—á–Ω—ã–µ –≤–æ–ø—Ä–æ—Å—ã - –¥–æ –≤—ã–ª–µ—Ç–∞ –º–µ–Ω–µ–µ 48 —á–∞—Å–æ–≤ - —Ä–∞—Å–ø–∏—Å–∞–Ω–∏–µ –≤—ã–ª–µ—Ç–∞',
   'role': 'user'},
  {'content': 'check_info: –ö–ò/–ê–≠–†/–í–ù–£–ö–û–í–û, –ö–ò/–ê–í–ö/–î–û–ù–ê–í–ò–ê, –¢–ú–û–í–°–ü–¢',
   'role': 'assistant'}],
 'text': ''}

–¢–µ–ø–µ—Ä—å –Ω–∞–º –Ω—É–∂–Ω–æ –ø—Ä–∏–º–µ–Ω–∏—Ç—å —à–∞–±–ª–æ–Ω —á–∞—Ç–∞ –¥–ª—è Qwen-3 –∫ —Ä–∞–∑–≥–æ–≤–æ—Ä–∞–º –∏ —Å–æ—Ö—Ä–∞–Ω–∏—Ç—å –µ–≥–æ –≤ —Ç–µ–∫—Å—Ç–æ–≤–æ–º —Ñ–æ—Ä–º–∞—Ç–µ.

In [10]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/216 [00:00<?, ? examples/s]

Let's see how the chat template did!


In [11]:
dataset[100]['text']

'<|im_start|>user\n–ö–ª–∞—Å—Å–∏—Ñ–∏—Ü–∏—Ä—É–π –∑–∞–ø—Ä–æ—Å –æ–ø–µ—Ä–∞—Ç–æ—Ä–∞ GDS –∏ –≤–µ—Ä–Ω–∏ –ø–æ—Å–ª–µ–¥–æ–≤–∞—Ç–µ–ª—å–Ω–æ—Å—Ç—å –∫–æ–º–∞–Ω–¥\n\nAlfaTravel::–°—Ä–æ—á–Ω—ã–µ –≤–æ–ø—Ä–æ—Å—ã - –¥–æ –≤—ã–ª–µ—Ç–∞ –º–µ–Ω–µ–µ 48 —á–∞—Å–æ–≤ - —Ä–∞—Å–ø–∏—Å–∞–Ω–∏–µ –≤—ã–ª–µ—Ç–∞<|im_end|>\n<|im_start|>assistant\ncheck_info: –ö–ò/–ê–≠–†/–í–ù–£–ö–û–í–û, –ö–ò/–ê–í–ö/–î–û–ù–ê–í–ò–ê, –¢–ú–û–í–°–ü–¢<|im_end|>\n'

<a name="Train"></a>
### –¢–µ–ø–µ—Ä—å –¥–∞–≤–∞–π—Ç–µ –ø–æ—Ç—Ä–µ–Ω–∏—Ä—É–µ–º –Ω–∞—à—É –º–æ–¥–µ–ª—å. –ú—ã –≤—ã–ø–æ–ª–Ω—è–µ–º 60 —à–∞–≥–æ–≤, —á—Ç–æ–±—ã —É—Å–∫–æ—Ä–∏—Ç—å –ø—Ä–æ—Ü–µ—Å—Å, –Ω–æ –≤—ã –º–æ–∂–µ—Ç–µ —É—Å—Ç–∞–Ω–æ–≤–∏—Ç—å num_train_epochs=1 –¥–ª—è –ø–æ–ª–Ω–æ–≥–æ –∑–∞–ø—É—Å–∫–∞ –∏ –æ—Ç–∫–ª—é—á–∏—Ç—å max_steps=None.

In [12]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/216 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


–ú—ã —Ç–∞–∫–∂–µ –∏—Å–ø–æ–ª—å–∑—É–µ–º –º–µ—Ç–æ–¥ train_on_completions –æ—Ç Unsloth, —á—Ç–æ–±—ã —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∞—Ç—å —Ç–æ–ª—å–∫–æ –≤—ã—Ö–æ–¥–Ω—ã–µ –¥–∞–Ω–Ω—ã–µ –ø–æ–º–æ—â–Ω–∏–∫–∞ –∏ –∏–≥–Ω–æ—Ä–∏—Ä–æ–≤–∞—Ç—å –ø–æ—Ç–µ—Ä–∏ –ø—Ä–∏ –≤–≤–æ–¥–µ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–º. –≠—Ç–æ –ø–æ–º–æ–≥–∞–µ—Ç –ø–æ–≤—ã—Å–∏—Ç—å —Ç–æ—á–Ω–æ—Å—Ç—å –Ω–∞—Å—Ç—Ä–æ–π–∫–∏!

In [13]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Map (num_proc=6):   0%|          | 0/216 [00:00<?, ? examples/s]

–î–∞–≤–∞–π—Ç–µ –ø—Ä–æ–≤–µ—Ä–∏–º, —á—Ç–æ –º–∞—Å–∫–∏—Ä–æ–≤–∫–∞ —á–∞—Å—Ç–∏ –∏–Ω—Å—Ç—Ä—É–∫—Ü–∏–∏ –≤—ã–ø–æ–ª–Ω–µ–Ω–∞! –î–∞–≤–∞–π—Ç–µ —Å–Ω–æ–≤–∞ –Ω–∞–ø–µ—á–∞—Ç–∞–µ–º 100-—é —Å—Ç—Ä–æ–∫—É.

In [14]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<|im_start|>user\n–ö–ª–∞—Å—Å–∏—Ñ–∏—Ü–∏—Ä—É–π –∑–∞–ø—Ä–æ—Å –æ–ø–µ—Ä–∞—Ç–æ—Ä–∞ GDS –∏ –≤–µ—Ä–Ω–∏ –ø–æ—Å–ª–µ–¥–æ–≤–∞—Ç–µ–ª—å–Ω–æ—Å—Ç—å –∫–æ–º–∞–Ω–¥\n\nAlfaTravel::–°—Ä–æ—á–Ω—ã–µ –≤–æ–ø—Ä–æ—Å—ã - –¥–æ –≤—ã–ª–µ—Ç–∞ –º–µ–Ω–µ–µ 48 —á–∞—Å–æ–≤ - —Ä–∞—Å–ø–∏—Å–∞–Ω–∏–µ –≤—ã–ª–µ—Ç–∞<|im_end|>\n<|im_start|>assistant\ncheck_info: –ö–ò/–ê–≠–†/–í–ù–£–ö–û–í–û, –ö–ò/–ê–í–ö/–î–û–ù–ê–í–ò–ê, –¢–ú–û–í–°–ü–¢<|im_end|>\n'

–¢–µ–ø–µ—Ä—å –¥–∞–≤–∞–π—Ç–µ —Ä–∞—Å–ø–µ—á–∞—Ç–∞–µ–º –∑–∞–º–∞—Å–∫–∏—Ä–æ–≤–∞–Ω–Ω—ã–π –ø—Ä–∏–º–µ—Ä - –≤—ã –¥–æ–ª–∂–Ω—ã —É–≤–∏–¥–µ—Ç—å, —á—Ç–æ –ø—Ä–∏—Å—É—Ç—Å—Ç–≤—É–µ—Ç —Ç–æ–ª—å–∫–æ –æ—Ç–≤–µ—Ç:

In [15]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                       check_info: –ö–ò/–ê–≠–†/–í–ù–£–ö–û–í–û, –ö–ò/–ê–í–ö/–î–û–ù–ê–í–ò–ê, –¢–ú–û–í–°–ü–¢<|im_end|>\n'

In [16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.816 GB of memory reserved.


–î–∞–≤–∞–π—Ç–µ –ø–æ—Ç—Ä–µ–Ω–∏—Ä—É–µ–º –º–æ–¥–µ–ª—å! –ß—Ç–æ–±—ã –≤–æ–∑–æ–±–Ω–æ–≤–∏—Ç—å —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫—É, —É—Å—Ç–∞–Ω–æ–≤–∏—Ç–µ trainer.train(resume_from_checkpoint = True)

In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 216 | Num Epochs = 3 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 66,060,288 of 4,088,528,384 (1.62% trained)


Step,Training Loss
1,5.2964
2,6.4091
3,5.9127
4,5.0694
5,4.2079
6,3.8408
7,3.3036
8,3.0837
9,2.7306
10,2.6356


In [18]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

154.3494 seconds used for training.
2.57 minutes used for training.
Peak reserved memory = 4.445 GB.
Peak reserved memory for training = 0.629 GB.
Peak reserved memory % of max memory = 30.154 %.
Peak reserved memory for training % of max memory = 4.267 %.


<a name="Inference"></a>
### Inference
–î–∞–≤–∞–π—Ç–µ –∑–∞–ø—É—Å—Ç–∏–º –º–æ–¥–µ–ª—å —Å –ø–æ–º–æ—â—å—é –≤—Å—Ç—Ä–æ–µ–Ω–Ω–æ–≥–æ –ª–æ–≥–∏—á–µ—Å–∫–æ–≥–æ –≤—ã–≤–æ–¥–∞ Unsloth! –°–æ–≥–ª–∞—Å–Ω–æ –∫–æ–º–∞–Ω–¥–µ Qwen-3, —Ä–µ–∫–æ–º–µ–Ω–¥—É–µ–º—ã–º–∏ –Ω–∞—Å—Ç—Ä–æ–π–∫–∞–º–∏ –¥–ª—è –ª–æ–≥–∏—á–µ—Å–∫–æ–≥–æ –≤—ã–≤–æ–¥–∞ –∫–æ–º–∞–Ω–¥ —è–≤–ª—è—é—Ç—Å—è —Ç–µ–º–ø–µ—Ä–∞—Ç—É—Ä–∞ = 0,7, top_p = 0,8, top_k = 20

–î–ª—è –ª–æ–≥–∏—á–µ—Å–∫–æ–≥–æ –≤—ã–≤–æ–¥–∞, –æ—Å–Ω–æ–≤–∞–Ω–Ω–æ–≥–æ –Ω–∞ —á–∞—Ç–µ, —Ç–µ–º–ø–µ—Ä–∞—Ç—É—Ä–∞ = 0,6, top_p = 0,95, top_k = 20

In [27]:
messages = [
    {"role": "user", "content": "–ö–ª–∞—Å—Å–∏—Ñ–∏—Ü–∏—Ä—É–π –∑–∞–ø—Ä–æ—Å –æ–ø–µ—Ä–∞—Ç–æ—Ä–∞ GDS –∏ –≤–µ—Ä–Ω–∏ –ø–æ—Å–ª–µ–¥–æ–≤–∞—Ç–µ–ª—å–Ω–æ—Å—Ç—å –∫–æ–º–∞–Ω–¥\n\n–º–æ–π –∞–≥–µ–Ω—Ç /—Å—Ç–∞—Ç—É—Å —Ä–µ–π—Å–∞"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to("cuda")

from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.1,
    top_p=0.9,
    top_k=50,
    streamer=streamer,
)

check_info: –ö–ò/–ê–≠–†/–í–ù–£–ö–û–í–û, –ö–ò/–ê–í–ö/–î–û–ù–ê–í–ò–ê, –¢–ú–û–í–°–ü–¢<|im_end|>


<a name="Save"></a>
### Saving, loading finetuned models
–ß—Ç–æ–±—ã —Å–æ—Ö—Ä–∞–Ω–∏—Ç—å –æ–∫–æ–Ω—á–∞—Ç–µ–ª—å–Ω—É—é –º–æ–¥–µ–ª—å –≤ –≤–∏–¥–µ –∞–¥–∞–ø—Ç–µ—Ä–æ–≤ LoRa, –∏—Å–ø–æ–ª—å–∑—É–π—Ç–µ push_to_hub –æ—Ç Huggingface –¥–ª—è –æ–Ω–ª–∞–π–Ω-—Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è –∏–ª–∏ save_pretrained –¥–ª—è –ª–æ–∫–∞–ª—å–Ω–æ–≥–æ —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è.

[–ü–†–ò–ú–ï–ß–ê–ù–ò–ï] –ü—Ä–∏ —ç—Ç–æ–º —Å–æ—Ö—Ä–∞–Ω—è—é—Ç—Å—è —Ç–æ–ª—å–∫–æ –∞–¥–∞–ø—Ç–µ—Ä—ã LoRa, –∞ –Ω–µ –ø–æ–ª–Ω–∞—è –º–æ–¥–µ–ª—å. –ß—Ç–æ–±—ã —Å–æ—Ö—Ä–∞–Ω–∏—Ç—å —Ñ–∞–π–ª –≤ —Ñ–æ—Ä–º–∞—Ç–µ 16bit –∏–ª–∏ GGUF, –ø—Ä–æ–∫—Ä—É—Ç–∏—Ç–µ –≤–Ω–∏–∑!

In [20]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

–¢–µ–ø–µ—Ä—å, –µ—Å–ª–∏ –≤—ã —Ö–æ—Ç–∏—Ç–µ –∑–∞–≥—Ä—É–∑–∏—Ç—å –∞–¥–∞–ø—Ç–µ—Ä—ã LoRa, –∫–æ—Ç–æ—Ä—ã–µ –º—ã —Ç–æ–ª—å–∫–æ —á—Ç–æ —Å–æ—Ö—Ä–∞–Ω–∏–ª–∏ –¥–ª—è –≤—ã–≤–æ–¥–∞, —É—Å—Ç–∞–Ω–æ–≤–∏—Ç–µ –∑–Ω–∞—á–µ–Ω–∏–µ False —Ä–∞–≤–Ω—ã–º True:

In [24]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

–ú—ã —Ç–∞–∫–∂–µ –ø–æ–¥–¥–µ—Ä–∂–∏–≤–∞–µ–º —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏–µ –Ω–µ–ø–æ—Å—Ä–µ–¥—Å—Ç–≤–µ–Ω–Ω–æ –≤ float16. –í—ã–±–µ—Ä–∏—Ç–µ merged_16bit –¥–ª—è float16 –∏–ª–∏ merged_4bit –¥–ª—è int4. –ú—ã —Ç–∞–∫–∂–µ —Ä–∞–∑—Ä–µ—à–∞–µ–º –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å –∞–¥–∞–ø—Ç–µ—Ä—ã lora –≤ –∫–∞—á–µ—Å—Ç–≤–µ –∑–∞–ø–∞—Å–Ω–æ–≥–æ –≤–∞—Ä–∏–∞–Ω—Ç–∞. –ò—Å–ø–æ–ª—å–∑—É–π—Ç–µ push_to_hub_merged –¥–ª—è –∑–∞–≥—Ä—É–∑–∫–∏ –≤ —Å–≤–æ–π –∞–∫–∫–∞—É–Ω—Ç Hugging Face! –í—ã –º–æ–∂–µ—Ç–µ –ø–µ—Ä–µ–π—Ç–∏ –ø–æ —Å—Å—ã–ª–∫–µ https://huggingface.co/settings/tokens –¥–ª—è –ø–æ–ª—É—á–µ–Ω–∏—è –ª–∏—á–Ω—ã—Ö —Ç–æ–∫–µ–Ω–æ–≤.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
–ß—Ç–æ–±—ã —Å–æ—Ö—Ä–∞–Ω–∏—Ç—å –≤ "GGUF" / "llama.cpp`, –º—ã —Ç–µ–ø–µ—Ä—å –ø–æ–¥–¥–µ—Ä–∂–∏–≤–∞–µ–º —ç—Ç–æ –∏–∑–Ω–∞—á–∞–ª—å–Ω–æ! –ú—ã –∫–ª–æ–Ω–∏—Ä—É–µ–º `llama.cpp" –∏ –ø–æ —É–º–æ–ª—á–∞–Ω–∏—é —Å–æ—Ö—Ä–∞–Ω—è–µ–º –≤ `q8_0`. –ú—ã —Ä–∞–∑—Ä–µ—à–∞–µ–º –≤—Å–µ –º–µ—Ç–æ–¥—ã, —Ç–∞–∫–∏–µ –∫–∞–∫ "q4_k_m`. –ò—Å–ø–æ–ª—å–∑—É–π—Ç–µ `save_pretrained_gguf` –¥–ª—è –ª–æ–∫–∞–ª—å–Ω–æ–≥–æ —Å–æ—Ö—Ä–∞–Ω–µ–Ω–∏—è –∏ `push_to_hub_gguf` –¥–ª—è –∑–∞–≥—Ä—É–∑–∫–∏ –≤ HF.

–ù–µ–∫–æ—Ç–æ—Ä—ã–µ –ø–æ–¥–¥–µ—Ä–∂–∏–≤–∞–µ–º—ã–µ –∫–≤–∞–Ω—Ç–æ–≤—ã–µ –º–µ—Ç–æ–¥—ã (–ø–æ–ª–Ω—ã–π —Å–ø–∏—Å–æ–∫ –Ω–∞ –Ω–∞—à–µ–π [–í–∏–∫–∏-—Å—Ç—Ä–∞–Ω–∏—Ü–µ](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - –ë—ã—Å—Ç—Ä–æ–µ –ø—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏–µ. –ò—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏–µ –±–æ–ª—å—à–∏—Ö —Ä–µ—Å—É—Ä—Å–æ–≤, –Ω–æ –≤ —Ü–µ–ª–æ–º –ø—Ä–∏–µ–º–ª–µ–º–æ.
* `q4_k_m` - –†–µ–∫–æ–º–µ–Ω–¥—É–µ—Ç—Å—è. –ü–æ–ª–æ–≤–∏–Ω–∞ –≤–Ω–∏–º–∞–Ω–∏—è —É–¥–µ–ª—è–µ—Ç—Å—è Q6_K.wv –∏ feed_forward.—Ç–µ–Ω–∑–æ—Ä—ã w2, –∏–Ω–∞—á–µ Q4_K.
* –†–µ–∫–æ–º–µ–Ω–¥—É–µ—Ç—Å—è `q5_k_m`. –ü–æ–ª–æ–≤–∏–Ω—É –≤–Ω–∏–º–∞–Ω–∏—è —É–¥–µ–ª—è–µ—Ç Q6_K.wv –∏ feed_forward.—Ç–µ–Ω–∑–æ—Ä—ã w2, –æ—Å—Ç–∞–ª—å–Ω—ã–µ Q5_K.

[**–ù–û–í–´–ô**] –î–ª—è —Ç–æ—á–Ω–æ–π –Ω–∞—Å—Ç—Ä–æ–π–∫–∏ –∏ –∞–≤—Ç–æ–º–∞—Ç–∏—á–µ—Å–∫–æ–≥–æ —ç–∫—Å–ø–æ—Ä—Ç–∞ –≤ Ollama –ø–æ–ø—Ä–æ–±—É–π—Ç–µ –Ω–∞—à [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

–¢–µ–ø–µ—Ä—å –∏—Å–ø–æ–ª—å–∑—É–π—Ç–µ —Ñ–∞–π–ª model-unsloth.gguf –∏–ª–∏ —Ñ–∞–π–ª model-unsloth-Q4_K_M.gguf –≤ llama.cpp.

–ò –º—ã –∑–∞–∫–æ–Ω—á–∏–ª–∏! –ï—Å–ª–∏ —É –≤–∞—Å –≤–æ–∑–Ω–∏–∫–Ω—É—Ç –∫–∞–∫–∏–µ-–ª–∏–±–æ –≤–æ–ø—Ä–æ—Å—ã –ø–æ Unsloth, —É –Ω–∞—Å –µ—Å—Ç—å –∫–∞–Ω–∞–ª –≤ Discord! –ï—Å–ª–∏ –≤—ã –æ–±–Ω–∞—Ä—É–∂–∏—Ç–µ –∫–∞–∫–∏–µ-–ª–∏–±–æ –æ—à–∏–±–∫–∏ –∏–ª–∏ —Ö–æ—Ç–∏—Ç–µ –±—ã—Ç—å –≤ –∫—É—Ä—Å–µ –ø–æ—Å–ª–µ–¥–Ω–∏—Ö —Ä–∞–∑—Ä–∞–±–æ—Ç–æ–∫ LLM, –∏–ª–∏ –≤–∞–º –Ω—É–∂–Ω–∞ –ø–æ–º–æ—â—å, –ø—Ä–∏—Å–æ–µ–¥–∏–Ω—è–π—Ç–µ—Å—å –∫ –ø—Ä–æ–µ–∫—Ç–∞–º –∏ —Ç.–¥., –Ω–µ —Å—Ç–µ—Å–Ω—è–π—Ç–µ—Å—å –ø—Ä–∏—Å–æ–µ–¥–∏–Ω—è—Ç—å—Å—è –∫ –Ω–∞—à–µ–º—É –¥–∏—Å–∫–æ—Ä–¥—É!

–ù–µ–∫–æ—Ç–æ—Ä—ã–µ –¥—Ä—É–≥–∏–µ —Å—Å—ã–ª–∫–∏:

–¢—Ä–µ–Ω–∏—Ä—É–π—Ç–µ —Å–≤–æ—é —Å–æ–±—Å—Ç–≤–µ–Ω–Ω—É—é –º–æ–¥–µ–ª—å –º—ã—à–ª–µ–Ω–∏—è - Llama GRPO notebook Free Colab
–°–æ—Ö—Ä–∞–Ω—è–π—Ç–µ –Ω–∞—Å—Ç—Ä–æ–π–∫–∏ –≤ Ollama. –ë–µ—Å–ø–ª–∞—Ç–Ω–∞—è –∑–∞–ø–∏—Å–Ω–∞—è –∫–Ω–∏–∂–∫–∞
Llama 3.2 Vision finetuning - –ø—Ä–∏–º–µ—Ä –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è –≤ —Ä–µ–Ω—Ç–≥–µ–Ω–æ–≥—Ä–∞—Ñ–∏–∏. –ë–µ—Å–ø–ª–∞—Ç–Ω—ã–π Colab
–°–º–æ—Ç—Ä–∏—Ç–µ –≤ –Ω–∞—à–µ–π –¥–æ–∫—É–º–µ–Ω—Ç–∞—Ü–∏–∏ –∑–∞–ø–∏—Å–Ω—ã–µ –∫–Ω–∏–∂–∫–∏ –¥–ª—è DPO, ORPO, –Ω–µ–ø—Ä–µ—Ä—ã–≤–Ω–æ–≥–æ –ø—Ä–µ–¥–≤–∞—Ä–∏—Ç–µ–ª—å–Ω–æ–≥–æ –æ–±—É—á–µ–Ω–∏—è, —Ä–∞–∑–≥–æ–≤–æ—Ä–Ω–æ–π –Ω–∞—Å—Ç—Ä–æ–π–∫–∏ –∏ –º–Ω–æ–≥–æ–µ –¥—Ä—É–≥–æ–µ!
