### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab and Kaggle notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: write).
The token `reasoning_training` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-

In [None]:
import torch
import gc
torch.cuda.empty_cache()
gc.collect()

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 32768 # Choose any! We auto support RoPE Scaling internally!
dtype = "bfloat16" # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-14B-Instruct",
    max_seq_length=max_seq_length,
    dtype="bfloat16",  # Explicitly set bfloat16
    load_in_4bit=False  # Ensure it does not use 4-bit quantization
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA H200. Max memory: 139.719 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.0+cu121. CUDA: 9.0. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank, suggested values: 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Optimized value
    bias="none",  # Optimized setting
    use_gradient_checkpointing="unsloth",  # Efficient for long context
    random_state=2503,
    use_rslora=False,  # No Rank-Stabilized LoRA
    loftq_config=None,  # No LoftQ quantization
)


Unsloth 2025.2.15 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.


In [4]:
from datasets import load_dataset
dataset = load_dataset("simplescaling/s1K-1.1", split = "train")

In [4]:
dataset[0]

{'solution': '128',
 'question': 'Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?',
 'cot_type': 'math',
 'source_type': 'qq8933/AIME_1983_2024',
 'metadata': "{'ID': '1991-5', 'Year': 1991, 'Problem Number': 5, 'Part': None}",
 'gemini_thinking_trajectory': '\nThe problem asks for the number of rational numbers between 0 and 1 such that when the rational number is written as a fraction in lowest terms, the product of the numerator and the denominator is $20!$.\n\nLet the rational number be $r$, where $0 < r < 1$.\nWe can write $r$ as a fraction $\\frac{a}{b}$, where $a$ and $b$ are positive integers, and $\\gcd(a, b) = 1$.\nSince $0 < r < 1$, we have $0 < \\frac{a}{b} < 1$, which implies $0 < a < b$.\n\nThe product of the numerator and the denominator of the fraction in lowest terms is $a \\times b$.\nWe are give

In [23]:
dataset['question'][0]

'Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?'

In [5]:
def example_to_messages(example):
    # question을 user 메시지로, solution을 assistant 메시지로 사용
    user_content = example.get("question", "")
    think_content = example.get("deepseek_thinking_trajectory", "")
    attempt_content = example.get("deepseek_attempt", "")

    # 실제 메시지 목록
    return [
        {"role": "user", "content": user_content},
        # 필요하다면 중간에 'assistant' 답변을 넣고 싶을 때만 추가합니다.
        {"role": "assistant", "content": f"<think>\n{think_content}</think>\n{attempt_content}"},
    ]

In [6]:
# on dataset for all rows, add "messages" column and add it using example_to_messages() function
dataset = dataset.map(lambda x: {"messages": example_to_messages(x)})


In [9]:
dataset["messages"][0]

[{'content': 'Given a rational number, write it as a fraction in lowest terms and calculate the product of the resulting numerator and denominator. For how many rational numbers between 0 and 1 will $20_{}^{}!$ be the resulting product?',
  'role': 'user'},
 {'content': '<think>\nAlright, so I need to figure out how many rational numbers between 0 and 1 have a product of the numerator and denominator (when written in lowest terms) equal to 20 factorial. Let me start by understanding the problem.\n\nFirst, a rational number between 0 and 1 can be written as a fraction \\(\\frac{a}{b}\\) where \\(0 < a < b\\) and \\(a\\) and \\(b\\) are coprime positive integers. The product \\(a \\times b\\) is said to be equal to 20! which is a huge number. So I need to find all pairs \\((a, b)\\) such that \\(a \\times b = 20!\\), \\(0 < a < b\\), and \\(\\gcd(a, b) = 1\\). Then count how many such pairs exist.\n\nLet me break down the problem.\n\nGiven that \\(a \\times b = 20!\\) and \\(\\gcd(a, b) 

In [7]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen2.5",
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass



In [8]:
from datasets import Dataset

dataset = dataset.map(formatting_prompts_func, batched = True,)

We look at how the conversations are structured for item 5:

In [12]:
dataset[5]["messages"]

[{'content': 'One base of a trapezoid is $100$ units longer than the other base. The segment that joins the midpoints of the legs divides the trapezoid into two regions whose areas are in the ratio $2: 3$ . Let $x$ be the length of the segment joining the legs of the trapezoid that is parallel to the bases and that divides the trapezoid into two regions of equal area. Find the greatest integer that does not exceed $x^2/100$ .',
  'role': 'user'},
 {'content': "<think>\nOkay, let's see. I need to solve this problem about a trapezoid with bases differing by 100 units. The segment connecting the midpoints of the legs divides the trapezoid into two regions with areas in the ratio 2:3. Then, we need to find x, which is the length of the segment that divides the trapezoid into two equal areas, and then find the greatest integer not exceeding x2/100. \n\nFirst, let me recall some trapezoid properties. In a trapezoid, the midline (the segment connecting the midpoints of the legs) has a length 

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [13]:
dataset[5]["text"]

"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nOne base of a trapezoid is $100$ units longer than the other base. The segment that joins the midpoints of the legs divides the trapezoid into two regions whose areas are in the ratio $2: 3$ . Let $x$ be the length of the segment joining the legs of the trapezoid that is parallel to the bases and that divides the trapezoid into two regions of equal area. Find the greatest integer that does not exceed $x^2/100$ .<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve this problem about a trapezoid with bases differing by 100 units. The segment connecting the midpoints of the legs divides the trapezoid into two regions with areas in the ratio 2:3. Then, we need to find x, which is the length of the segment that divides the trapezoid into two equal areas, and then find the greatest integer not exceeding x2/100. \n\nFirst, let me recall some trapezoid 

In [11]:
# get token length of this
token_length = len(tokenizer(dataset[5]["text"]).input_ids)

# for all dataset's "text", get token length and get distribution. highest, number of rows that over 8192
lengths = [len(tokenizer(data["text"]).input_ids) for data in dataset]

In [19]:
# get lengths over 16384
lengths_over_16384 = [length for length in lengths if length > 16384]
len(lengths_over_16384)
print(lengths_over_16384)

[17342, 16434, 17261, 18053, 16964, 17433, 17634, 18227, 17192, 16747, 16786, 20225, 17767, 16685, 17029, 17126, 16404, 17611, 16592, 17153, 16759, 16594, 16634, 18943, 17135, 16885, 16602, 16440, 19668, 17158, 16899, 18853, 18537, 16894, 16715, 17097, 16948, 17024, 17578, 16741, 16770, 16541, 16578, 19043, 17344, 17327, 16749, 17641, 16664, 16847, 17302, 18991, 18801, 17831, 17416, 16523, 16656, 17382, 17414, 17050, 17355, 17360, 16452, 17435, 17569, 18039, 23516, 16540, 17138, 18634, 17346, 16964, 17188, 16463, 17563, 17047, 16432, 16971, 16849, 16934, 16818, 17374, 18265, 16409, 17190, 17789, 20792, 21786, 20050, 16499, 16953, 26967, 22449, 19819, 17517, 16394, 26698, 22630, 19584, 16503, 19615, 18629, 21684]


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [10]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,  # Can make training faster for short sequences if needed
    args=TrainingArguments(
        # 요청 사항: per_device_train_batch_size = 16
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,

        # 요청 사항: 5 에포크
        num_train_epochs=5,

        # 요청 사항: learning_rate = 1e-5
        learning_rate=1e-5,

        # 요청 사항: warmup은 전체 스텝의 5% (315 스텝 중 5%면 약 16 스텝)
        # 정확히 16 스텝으로 설정
        warmup_steps=32,

        # 요청 사항: 스케줄은 코사인
        lr_scheduler_type="cosine",

        # 요청 사항: AdamW + betas = (0.9, 0.95), weight decay = 1e-4
        # huggingface Trainer에서는 adam_beta1, adam_beta2 로 설정
        # weight_decay 값도 변경
        optim="adamw_8bit",
        adam_beta1=0.9,
        adam_beta2=0.95,
        weight_decay=1e-4,

        # precision 관련: bfloat16이 지원되면 bf16=True, 아니면 fp16
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),

        # 로깅 스텝은 원하는 대로
        logging_steps=1,

        # 재현성
        seed=2503,

        # 결과 출력 경로
        output_dir="outputs",

        # 보고 설정 (wandb를 쓰면 "wandb" 등으로 변경)
        report_to="none",
    ),
)


  super().__init__(


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [11]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part=(
        "<|im_start|>user\n"         # 사용자 메시지 시작
    ),
    response_part=(
        "<|im_end|>\n"               # 사용자 메시지 종료
        "<|im_start|>assistant\n"    # 어시스턴트 메시지 시작
    )
)


We verify masking is actually done:

In [24]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nOne base of a trapezoid is $100$ units longer than the other base. The segment that joins the midpoints of the legs divides the trapezoid into two regions whose areas are in the ratio $2: 3$ . Let $x$ be the length of the segment joining the legs of the trapezoid that is parallel to the bases and that divides the trapezoid into two regions of equal area. Find the greatest integer that does not exceed $x^2/100$ .<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve this problem about a trapezoid with bases differing by 100 units. The segment connecting the midpoints of the legs divides the trapezoid into two regions with areas in the ratio 2:3. Then, we need to find x, which is the length of the segment that divides the trapezoid into two equal areas, and then find the greatest integer not exceeding x2/100. \n\nFirst, let me recall some trapezoid 

In [25]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                                                                                                                           \n<think>\nOkay, let's see. I need to solve this problem about a trapezoid with bases differing by 100 units. The segment connecting the midpoints of the legs divides the trapezoid into two regions with areas in the ratio 2:3. Then, we need to find x, which is the length of the segment that divides the trapezoid into two equal areas, and then find the greatest integer not exceeding x2/100. \n\nFirst, let me recall some trapezoid properties. In a trapezoid, the midline (the segment connecting the midpoints of the legs) has a length equal to the average of the two bases. Also, if there is a segment parallel to the bases that divides the trapezoid into two regions of equal area, its length x should be such that the square of x is the average of the squares of the two bases. Wait, no, let me check. \n\nI remember that for areas divided by a line parall

In [26]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nOne base of a trapezoid is $100$ units longer than the other base. The segment that joins the midpoints of the legs divides the trapezoid into two regions whose areas are in the ratio $2: 3$ . Let $x$ be the length of the segment joining the legs of the trapezoid that is parallel to the bases and that divides the trapezoid into two regions of equal area. Find the greatest integer that does not exceed $x^2/100$ .<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve this problem about a trapezoid with bases differing by 100 units. The segment connecting the midpoints of the legs divides the trapezoid into two regions with areas in the ratio 2:3. Then, we need to find x, which is the length of the segment that divides the trapezoid into two equal areas, and then find the greatest integer not exceeding x2/100. \n\nFirst, let me recall some trapezoid 

We can see the System and Instruction prompts are successfully masked!

In [12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA H200. Max memory = 139.719 GB.
27.857 GB of memory reserved.


In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 310
 "-____-"     Number of trainable parameters = 68,812,800


Step,Training Loss
1,1.1253
2,1.2468
3,1.1994
4,1.2156
5,1.2401
6,1.1726
7,1.2407
8,1.0922
9,1.1363
10,1.2182


In [14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

21464.4445 seconds used for training.
357.74 minutes used for training.
Peak reserved memory = 52.219 GB.
Peak reserved memory for training = 24.362 GB.
Peak reserved memory % of max memory = 37.374 %.
Peak reserved memory for training % of max memory = 17.436 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [30]:
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
)
inputs

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n피보나치 수열을 이어서 계속 써봐: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n'

In [16]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "피보나치 수열을 이어서 계속 써봐: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 8192,
                   use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


피보나치 수열은 각 항이 바로 앞의 두 항의 합으로 이루어져 있습니다. 주어진 수열을 이어서 계속 쓰면 다음과 같습니다:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...

수열에서 다음 숫자는 바로 앞의 두 숫자를 더한 값입니다. 예를 들어,

- 8 + 5 = 13
- 13 + 8 = 21
- 21 + 13 = 34
- 34 + 21 = 55

이렇게 계속 더해 나가면 피보나치 수열을 만들 수 있습니다.<|im_end|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [15]:
# Merge to 16bit
if True: model.save_pretrained_merged("qwen-14b-s1.1", tokenizer, save_method = "merged_16bit")
if True: model.push_to_hub_merged("jonhpark/qwen-14b-s1.1", tokenizer, save_method = "merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1336.4 out of 2015.55 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 48/48 [00:00<00:00, 100.05it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: You are pushing to hub, but you passed your HF username = jonhpark.
We shall truncate jonhpark/qwen-14b-s1.1 to qwen-14b-s1.1


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1336.12 out of 2015.55 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 48/48 [00:00<00:00, 110.21it/s]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

 Done.


README.md:   0%|          | 0.00/577 [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/4.73G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/jonhpark/qwen-14b-s1.1


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
