To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!

Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [2]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

In [3]:
%%capture
# These are mamba kernels and we must have these for faster training
#!pip install --no-build-isolation mamba_ssm==2.2.5
#!pip install --no-build-isolation causal_conv1d==1.5.2

### Unsloth

In [10]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/granite-4.0-micro",
    "unsloth/granite-4.0-h-micro",
    "unsloth/granite-4.0-h-tiny",
    "unsloth/granite-4.0-h-small",

    # Base pretrained Granite 4 models
    "unsloth/granite-4.0-micro-base",
    "unsloth/granite-4.0-h-micro-base",
    "unsloth/granite-4.0-h-tiny-base",
    "unsloth/granite-4.0-h-small-base",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/granite-4.0-350m",
    max_seq_length = 256,   # Further reduced max_seq_length to 256
    load_in_4bit = True,    # 4 bit quantization to reduce memory
    load_in_8bit = False,    # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
)

==((====))==  Unsloth 2025.12.1: Fast Granitemoehybrid patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update a small amount of parameters!

In [8]:
from unsloth import FastLanguageModel # Added import here for robustness

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Further reduced LoRA rank to 8
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "shared_mlp.input_linear", "shared_mlp.output_linear"],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


In [5]:
from datasets import load_dataset
import json
import os

# ------------------------------------------------------------
# Вариант 2: если датасет хранится в Google Drive
# ------------------------------------------------------------
from google.colab import drive
drive.mount('/content/drive')  # ← Убран отступ
dataset = load_dataset(
    "json",
    data_files="/content/drive/MyDrive/data/dataset.json",
    split="train"
)

# Проверка
print("Пример записи:")
print(dataset[0])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Пример записи:
{'instruction': 'Что такое капельная трубка и чем она отличается от капельной ленты?', 'input': '', 'output': 'Капельная трубка (drip line) - это гибкая трубка круглого сечения с встроенными или внешними капельницами-эмиттерами, обеспечивающими равномерную подачу воды к корням растений. В отличие от капельной ленты, которая представляет собой тонкостенную плоскую трубку, расправляющуюся под давлением, капельная трубка имеет постоянное круглое сечение и более прочную конструкцию. Основные различия: капельная трубка служит 5-15 лет (в зависимости от толщины стенки), капельная лента рассчитана на 1-3 сезона; трубка устойчива к механическим повреждениям, лента более уязвима; трубка сохраняет форму без давления, лента сплющивается. Капельные трубки используются в стационарных системах орошения, ленты - в сезонных и временных установках. Внутренний д

We've just loaded the Google Sheet as a csv style Dataset, but we still need to format it into conversational style like below and then apply the chat template.

```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

We'll use a helper function `formatting_prompts_func` to do both!

In [6]:
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]          # может быть пустой строкой
    outputs = examples["output"]

    texts = []
    for instr, inp, out in zip(instructions, inputs, outputs):
        # Формируем промпт в стиле Granite / LLaMA
        if inp.strip():  # если input не пустой
            prompt = f"### Инструкция:\n{instr}\n\n### Ввод:\n{inp}\n\n### Ответ:\n{out}"
        else:
            prompt = f"### Инструкция:\n{instr}\n\n### Ответ:\n{out}"
        texts.append(prompt)

    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

We now look at the raw input data before formatting.

And we see how the chat template transformed these conversations.

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [9]:
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",  # ← без использования
    # Вместо этого — форматируем вручную:
    formatting_func=lambda example: [
        f"### Инструкция:\n{example['instruction']}\n\n### Ввод:\n{example['input']}\n\n### Ответ:\n{example['output']}"
    ],
    max_seq_length=256,  # Further reduced max_seq_length to 256
    dataset_num_proc=2,
    packing=False,  # рекомендуется для коротих ответов
    args=TrainingArguments(
        per_device_train_batch_size=1, # Reduced batch size
        gradient_accumulation_steps=32, # Increased gradient accumulation steps
        warmup_steps=10,
        max_steps=200,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        report_to="none",  # отключает W&B
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [13]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'### Инструкция:\nWhat materials are used to manufacture drip tape starter valves?\n\n### Ответ:\nDrip tape starter valves are primarily manufactured from polypropylene (PP) material, which provides durability, resistance to aging, and anti-clogging properties. [[18]] The rubber seals used in some valve designs are typically made from specialized elastomers that maintain flexibility and sealing properties under irrigation conditions. [[16]] These materials are chosen for their resistance to UV radiation, chemical exposure from fertilizers, and ability to withstand typical irrigation pressures up to 1.5 bar. [[16]] The construction ensures long-term reliability and minimal maintenance requirements in agricultural and garden irrigation applications.'

Now let's print the masked out example - you should see only the answer is present:

In [14]:
# Get input_ids from the dataset
input_ids_example = trainer.train_dataset[100]["input_ids"]

# Define the start string for the assistant's response
response_start_string = "### Ответ:"

# Tokenize the response start string to find its token representation
# Use add_special_tokens=False to match how content within the full text was likely tokenized.
response_separator_token_ids = tokenizer.encode(response_start_string, add_special_tokens=False)

# Helper function to find a sublist within a list
def find_sublist(haystack, needle):
    n = len(needle)
    for i in range(len(haystack) - n + 1):
        if haystack[i:i + n] == needle:
            return i
    return -1

# Find the start index of the response separator token sequence in the full input_ids
separator_start_token_idx = find_sublist(input_ids_example, response_separator_token_ids)

labels_to_decode = []
if separator_start_token_idx != -1:
    # The actual answer starts immediately after the separator tokens.
    # Mask everything up to and including the separator.
    answer_start_token_idx = separator_start_token_idx + len(response_separator_token_ids)
    labels_to_decode = [-100] * answer_start_token_idx + list(input_ids_example[answer_start_token_idx:])
else:
    # Fallback if the separator is not found, indicating an unexpected format.
    # In this case, we can't perform the masking as intended.
    # For display, we'll just show the full input_ids, or a message.
    # The instruction is to show the masked part, so if not found, we state it.
    print(f"Could not find '{response_start_string}' in the tokenized input to perform masking.")
    labels_to_decode = input_ids_example # Fallback to show something

# Decode the generated labels, replacing -100 with tokenizer.pad_token for visibility.
# Then replace actual pad_token string with a space for cleaner output.
decoded_masked_labels_str = tokenizer.decode(
    [tokenizer.pad_token_id if x == -100 else x for x in labels_to_decode]
).replace(tokenizer.pad_token, " ")

print(decoded_masked_labels_str)

Could not find '### Ответ:' in the tokenized input to perform masking.
### Инструкция:
What materials are used to manufacture drip tape starter valves?

### Ответ:
Drip tape starter valves are primarily manufactured from polypropylene (PP) material, which provides durability, resistance to aging, and anti-clogging properties. [[18]] The rubber seals used in some valve designs are typically made from specialized elastomers that maintain flexibility and sealing properties under irrigation conditions. [[16]] These materials are chosen for their resistance to UV radiation, chemical exposure from fertilizers, and ability to withstand typical irrigation pressures up to 1.5 bar. [[16]] The construction ensures long-term reliability and minimal maintenance requirements in agricultural and garden irrigation applications.


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

```
Notice you might have to wait ~10 minutes for the Mamba kernels to compile! Please be patient!
```

In [15]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 203 | Num Epochs = 29 | Total steps = 200
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 32
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 32 x 1) = 32
 "-____-"     Trainable parameters = 9,601,024 of 3,200,997,120 (0.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.40 GiB is free. Process 292209 has 13.34 GiB memory in use. Of the allocated memory 13.09 GiB is allocated by PyTorch, and 103.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! We'll use some example snippets not contained in our training data to get a sense of what was learned.

In [11]:
# @title Test Scenarios
# --- Scenario 1: Video-Conferencing Screen-Share Bug (11 turns) ---
scenario_1 = """
User: Everyone in my meeting just sees a black screen when I share.
Agent: Sorry about that—are you sharing a window or your entire screen?
User: Entire screen on macOS Sonoma.
Agent: Thanks. Do you have “Enable hardware acceleration” toggled on in Settings → Video?
User: Yeah, that switch is on.
Agent: Could you try toggling it off and start a quick test share?
User: Did that—still black for attendees.
Agent: Understood. Are you on the desktop app v5.4.2 or the browser client?
User: Desktop v5.4.2—just updated this morning.
"""

# --- Scenario 2: Smart-Lock Low-Battery Loop (9 turns) ---
scenario_2 = """
User: I changed the batteries, but the lock app still says 5 % and won’t auto-lock.
Agent: Let’s check firmware. In the app, go to Settings → Device Info—what version shows?
User: 3.18.0-alpha.
Agent: Latest stable is 3.17.5. Did you enroll in the beta program?
User: I might have months ago.
Agent: Beta builds sometimes misreport battery. Remove one battery, wait ten seconds, reinsert, and watch the LED pattern.
User: LED blinks blue twice, then red once.
Agent: That blink code means “config mismatch.” Do you still have the old batteries handy?
User: Tossed them already.
"""

# --- Scenario 3: Accounting SaaS — Corrupted Invoice Export (10 turns) ---
scenario_3 = """
User: Every invoice I download today opens as a blank PDF.
Agent: Is this happening to historic invoices, new ones, or both?
User: Both. Anything I export is 0 bytes.
Agent: Are you exporting through “Bulk Actions” or individual invoice pages?
User: Individual pages.
Agent: Which browser/OS combo?
User: Chrome on Windows 11, latest update.
Agent: We released a new PDF renderer at 10 a.m. UTC. Could you try Edge quickly, just to rule out a caching quirk?
User: Tried Edge—same zero-byte file.
"""

# --- Scenario 4: Fitness-Tracker App — Stuck Step Count (8 turns) ---
scenario_4 = """
User: My step count has been frozen at 4,237 since last night.
Agent: Which phone are you syncing with?
User: iPhone 15, iOS 17.5.
Agent: In the Health Permissions screen, does “Motion & Fitness” show as ON?
User: Yes, it’s toggled on.
Agent: When you pull down to refresh the dashboard, does the sync spinner appear?
User: Spinner flashes for a second, then nothing changes.
"""

# --- Scenario 5: Online-Course Platform — Quiz Submission Error (12 turns) ---
scenario_5 = """
User: My quiz submits but then shows “Unknown grading error” and resets the answers.
Agent: Which course and quiz name?
User: History 301, Unit 2 Quiz.
Agent: Do you notice a red banner or any code like GR-### in the corner?
User: Banner says “GR-412”.
Agent: That code points to answer-payload size. Were you pasting images or long text into any answers?
User: Maybe a long essay—about 800 words in Question 5.
Agent: Are you on a laptop or mobile?
User: Laptop, Safari on macOS.
"""

In [12]:
FastLanguageModel.for_inference(model)  # Включает ускорение

# Вручную формируем промпт (без input)
prompt = f"### Instruction:\n{scenario_1}\n\n### Response:\n"

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256, # Aligned with training max_seq_length
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=False)

outputs = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    do_sample=True,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    pad_token_id=tokenizer.pad_token_id,  # важно!
    eos_token_id=tokenizer.eos_token_id,  # важно!
)

### Instruction:

User: Everyone in my meeting just sees a black screen when I share.
Agent: Sorry about that—are you sharing a window or your entire screen?
User: Entire screen on macOS Sonoma.
Agent: Thanks. Do you have “Enable hardware acceleration” toggled on in Settings → Video?
User: Yeah, that switch is on.
Agent: Could you try toggling it off and start a quick test share?
User: Did that—still black for attendees.
Agent: Understood. Are you on the desktop app v5.4.2 or the browser client?
User: Desktop v5.4.2—just updated this morning.


### Response:
Hello! I'm sorry to hear you're experiencing this issue. Let's try a few troubleshooting steps:

1. **Restart the App**: Sometimes, simply restarting the desktop app can resolve display issues. Close the Zoom app completely and then open it again to see if the problem persists.

2. **Check Display Settings**: Ensure that your Mac's display settings are not causing the issue. Go to **System Settings** > **Displays**, and make sure t

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": scenario_2},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    padding = True,
    return_tensors = "pt",
    return_dict=True,
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = False)

_ = model.generate(**inputs,
                   streamer = text_streamer,
                   max_new_tokens = 512, # Increase if tokens are getting cut off
                   use_cache = True,
                   # Adjust the sampling params to your preference
                   do_sample=False,
                   temperature = 0.7, top_p = 0.8, top_k = 20,
)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
# Save to 8bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: # Pushing to HF Hub
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
