<a href="https://colab.research.google.com/github/prasanth-ntu/pookie-llm-finetuning-resources/blob/main/finetuning/unsloth/reasoning_grpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating a reasoning model using unsloth

This notebook makes use of unsloth to finetune a llama 3.1 model into a reasoning model. I would recommend using the unsloth library compared to just using the huggingface library as it requires less memory and is faster.

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.8.5.post1

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [4]:
import importlib
libraries = ["unsloth", "vllm", "bitsandbytes", "accelerate", "xformers", "peft", "trl", "triton", "cut_cross_entropy", "unsloth_zoo", "sentencepiece", "protobuf", "datasets", "huggingface_hub", "hf_transfer", "msgspec", "blake3", "gguf"]
for library in libraries:
  try:
    print(f"{library}: {importlib.metadata.version(library)}")
  except Exception as e:
    print(e)


unsloth: 2025.8.1
vllm: 0.8.5.post1
bitsandbytes: 0.46.1
accelerate: 1.9.0
xformers: 0.0.29.post3
peft: 0.16.0
trl: 0.21.0
triton: 3.2.0
cut_cross_entropy: 25.1.1
unsloth_zoo: 2025.8.1
sentencepiece: 0.2.0
protobuf: 5.29.5
datasets: 3.6.0
huggingface_hub: 0.34.3
hf_transfer: 0.1.9
msgspec: 0.19.0
blake3: 1.0.5
gguf: 0.17.1


### Add lora to base model and patch with Unsloth

**`Qwen/Qwen2.5-3B-Instruct` explained**

Model details: [HF](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)


**Lora Rank explained**

LoRA works by adding small, low-rank matrices to specific layers of the pre-trained model. The rank determines the dimensionality of these matrices, and a higher rank generally means more parameters are added, potentially allowing for more expressiveness during finetuning, but also increasing memory usage and computational cost.

So, `lora_rank = 32` means that the LoRA matrices added to the target modules will have a rank of 32.



In [5]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
lora_rank = 32

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 08-07 09:08:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-07 09:08:40 [__init__.py:239] Automatically detected platform cuda.
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.1: Fast Qwen2 patching. Transformers: 4.54.1. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute ca

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 08-07 09:09:27 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 09:09:27 [cuda.py:289] Using XFormers backend.
INFO 08-07 09:09:28 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 08-07 09:09:28 [model_runner.py:1108] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 08-07 09:09:29 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-07 09:09:30 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 08-07 09:10:03 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 33.030224 seconds
INFO 08-07 09:10:03 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-07 09:10:08 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 08-07 09:10:09 [model_runner.py:1140] Model loading took 2.3277 GiB and 39.988580 seconds
INFO 08-07 09:10:17 [worker.py:287] Memory profiling takes 8.04 seconds
INFO 08-07 09:10:17 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 08-07 09:10:17 [worker.py:287] model weights take 2.33GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 5.36GiB.
INFO 08-07 09:10:18 [executor_base.py:112] # cuda blocks: 9755, # CPU blocks: 0
INFO 08-07 09:10:18 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 152.42x
INFO 08-07 09:10:18 [vllm_utils.py:669] Unsloth: Running patched vLLM v0 `capture_model`.
INFO 08-07 09:10:18 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run th

Capturing CUDA graph shapes:   0%|          | 0/27 [00:00<?, ?it/s]

INFO 08-07 09:10:46 [model_runner.py:1592] Graph capturing finished in 28 secs, took 0.56 GiB
INFO 08-07 09:10:46 [vllm_utils.py:676] Unsloth: Patched vLLM v0 graph capture finished in 28 secs.
INFO 08-07 09:10:47 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 38.36 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'q_norm', 'k_norm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'q_norm', 'k_norm', 'post_feedforward_layernorm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.8.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Data Prep
<a name="Data"></a>

- Uses [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) reward functions.

- Dataset can be founder here: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

In [6]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """Calculates a reward based on the correctness of the model's extracted answer.

    This function compares the extracted answer from the model's completion
    with the ground truth answer. It assigns a reward of 2.0 if the extracted
    answer exactly matches the ground truth, and 0.0 otherwise.

    Args:
        prompts (list): A list of prompts given to the model.
        completions (list): A list of the model's generated completions.
        answer (list): A list containing the correct answer(s) for the prompts.
        **kwargs: Additional keyword arguments.

    Returns:
        list[float]: A list of reward values (2.0 for correct, 0.0 for incorrect),
                    corresponding to each completion.
    """
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    """Calculates reward based on whether the model's extracted answer consists only of digits.

    This function checks if the extracted answer string contains only digit characters.
    It assigns a reward of 0.5 if the answer is composed solely of digits, and 0.0 otherwise.

    Args:
        completions (list): A list of the model's generated completions.
        **kwargs: Additional keyword arguments.

    Returns:
        list[float]: A list of reward values (0.5 for digit-only, 0.0 otherwise),
                     corresponding to each completion.
    """
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

# In summary, both functions reward the model for using the specified XML tag structure, but strict_format_reward_func requires a precise layout including newlines, while soft_format_reward_func is more lenient regarding whitespace between the tags. Both give a reward of 0.5 if the pattern is found at the beginning of the response.
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    # This pattern is strict because it explicitly requires specific newline characters (\n) at the end of the opening tags and before the closing tags.
    # Crucially, it requires exactly a 6 newlines at specific locations.
    # The ^ and $ anchors ensure that the entire string must match this pattern from beginning to end.
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    # This pattern is soft because it uses \s* between </reasoning> and <answer>.
    # \s* matches zero or more whitespace characters. This means the pattern will match if there's a  space, a newline, multiple spaces, multiple newlines, tabs, or any combination of these between the closing reasoning tag and the opening answer tag.
    # It also lacks the ^ and $ anchors, meaning it would technically match if this pattern appeared anywhere within a larger string (although re.match in the function attempts to match from the beginning).
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001 # penalty for trailing characters after the closing answer tag, potentially allowing for a single trailing newline without penalty (since len(...) - 1 would be 0 if there's only one character, presumably a newline).
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """Calculates a reward based on the count and positioning of XML tags in the completion."""
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [7]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 6,
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [8]:
%%time
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 59,867,136 of 3,145,805,824 (1.90% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
Mr. Benson bought 12 concert tickets, so 2 of the tickets were eligible for a 5% discount because he bought more than 10 tickets. The cost of a ticket is $40. The first 10 tickets cost $40 each, and the 12th ticket gets a 5% discount.

Calculate the cost for the first 10 tickets:
10 tickets * $40 per ticket = $400

Next, calculate the discounted price for the 12th ticket:
Original price = $40
Discount = 5% of $40 = 0.05 * 40 = $2
Discounted price = $40 - $2 = $38

Now add the cost of the first 10 tickets with the discounted price of the 12th ticket:
Total cost = $400 + $38 = $438

</reasoning>
<answer>
Mr. Benson paid a total of $438 for the 12 concert tickets.</answer>
 
Extracted:
Mr. Benson paid a total of $438 for the 12 concert tickets.


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,entropy,rewards / xmlcount_reward_func / mean,rewards / xmlcount_reward_func / std,rewards / soft_format_reward_func / mean,rewards / soft_format_reward_func / std,rewards / strict_format_reward_func / mean,rewards / strict_format_reward_func / std,rewards / int_reward_func / mean,rewards / int_reward_func / std,rewards / correctness_reward_func / mean,rewards / correctness_reward_func / std
1,-0.0,-0.421667,0.107964,267.333344,228.0,303.0,0.0,267.333344,228.0,303.0,0.0,0,-0.421667,0.107964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.0,-0.536,0.855015,470.0,260.0,630.0,0.0,470.0,260.0,630.0,0.0,No Log,-0.536,0.855015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.0,-0.560333,0.275863,295.833344,205.0,396.0,0.0,295.833344,205.0,396.0,5e-06,No Log,-0.560333,0.275863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.0,-0.351833,0.241953,250.666672,152.0,320.0,0.0,250.666672,152.0,320.0,2.3e-05,No Log,-0.351833,0.241953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,-0.255667,0.046405,252.666672,223.0,299.0,0.0,252.666672,223.0,299.0,6e-06,No Log,-0.255667,0.046405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,-0.244833,1.086855,402.666687,356.0,511.0,0.0,402.666687,356.0,511.0,6e-06,No Log,-0.6615,0.173457,0.0,0.0,0.0,0.0,0.083333,0.204124,0.333333,0.816497
7,0.0,0.649833,1.415295,444.5,318.0,590.0,0.0,444.5,318.0,590.0,8e-06,No Log,-0.600167,0.38021,0.0,0.0,0.0,0.0,0.25,0.273861,1.0,1.095445
8,0.0,-0.7785,0.491235,320.5,179.0,514.0,0.0,320.5,179.0,514.0,1.1e-05,No Log,-0.7785,0.491235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,-0.0,-0.363833,0.060605,245.666672,213.0,284.0,0.0,245.666672,213.0,284.0,9e-06,No Log,-0.363833,0.060605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,-0.537833,0.192977,321.0,264.0,391.0,0.0,321.0,264.0,391.0,6e-06,No Log,-0.537833,0.192977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<reasoning>
To calculate the monthly payments for both the house and the trailer, we need to use the formula for the monthly payment on a loan, which is given by:
\[ M = P \frac{r(1+r)^n}{(1+r)^n-1} \]
where:
- \( M \) is the monthly payment,
- \( P \) is the principal (loan amount),
- \( r \) is the monthly interest rate (annual interest rate divided by 12),
- \( n \) is the number of payments (loan term in months).

However, the house and trailer are not listed in a mortgage calculator, implying they are assumed to have no interest. Given no interest, the monthly payment can be calculated as the total cost of the property divided by the number of years in months (20 years for both 

TrainOutput(global_step=250, training_loss=7.187695266852833e-05, metrics={'train_runtime': 5794.1285, 'train_samples_per_second': 0.259, 'train_steps_per_second': 0.043, 'total_flos': 0.0, 'train_loss': 7.187695266852833e-05})

<a name="Inference"></a>
### Inference
Try inference before adding lora

In [9]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'There are two "r"s in the word "strawberry".'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [10]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [11]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'<reasoning>\nTo find out how many r\'s are in the word "strawberry," we need to count each occurrence of the letter r in the word. Counting, we see that r appears 3 times in "strawberry."\n</reasoning>\n<answer>\n3\n</answer>'

In [12]:
print(output)

<reasoning>
To find out how many r's are in the word "strawberry," we need to count each occurrence of the letter r in the word. Counting, we see that r appears 3 times in "strawberry."
</reasoning>
<answer>
3
</answer>


## Saving

### Save lora adapter

This is both useful for inference and if you want to load the model again

In [13]:
model.push_to_hub(
    # "pookie3000/llama-3.1-8B-Instruct-Reasoning-lora",
    "prasanthntu/llama-3.1-8B-Instruct-Reasoning-lora",
    tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

NameError: name 'userdata' is not defined

### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub_gguf(
    # "pookie3000/Qwen2.5-3B-Reasoning-GGUF",
    "prasanthntu/Qwen2.5-3B-Reasoning-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
)