## Creating a reasoning model using unsloth

This notebook makes use of unsloth to finetune a llama 3.1 model into a reasoning model. I would recommend using the unsloth library compared to just using the huggingface library as it requires less memory and is faster.

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

In [None]:
%%capture
import os
!pip install --no-deps unsloth vllm
!pip install --no-deps unsloth vllm
# [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
# Skip restarting message in Colab
import sys, re, requests; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

# vLLM requirements - vLLM breaks Colab due to reinstalling numpy
f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
with open("vllm_requirements.txt", "wb") as file:
    file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
!pip install -r vllm_requirements.txt

### Add lora to base model and patch with Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024
lora_rank = 32

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct", # 3B parameters model, we need a lot of memory for this GRPO model
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-22 17:53:15 [__init__.py:256] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.18: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.8.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 6.45 GB. A

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 03-22 17:53:43 [cuda.py:234] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-22 17:53:43 [cuda.py:282] Using XFormers backend.
INFO 03-22 17:53:43 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-22 17:53:43 [model_runner.py:1110] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 03-22 17:53:44 [loader.py:1137] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-22 17:53:44 [weight_utils.py:257] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 03-22 17:54:00 [weight_utils.py:273] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 15.187338 seconds
INFO 03-22 17:54:00 [weight_utils.py:307] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-22 17:54:04 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-22 17:54:04 [model_runner.py:1146] Model loading took 2.3276 GB and 20.192531 seconds
INFO 03-22 17:54:15 [worker.py:267] Memory profiling takes 10.70 seconds
INFO 03-22 17:54:15 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 03-22 17:54:15 [worker.py:267] model weights take 2.33GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 5.36GiB.
INFO 03-22 17:54:16 [executor_base.py:111] # cuda blocks: 9755, # CPU blocks: 3640
INFO 03-22 17:54:16 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 152.42x
INFO 03-22 17:54:18 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:53<00:00,  1.97s/it]

INFO 03-22 17:55:11 [model_runner.py:1570] Graph capturing finished in 53 secs, took 0.56 GiB
INFO 03-22 17:55:11 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 66.78 seconds





tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.3.18 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Data Prep
<a name="Data"></a>

Uses [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) reward functions. (It's a researcher)

In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip() # base on the dataset format so we need to parse it.

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # open AI dataset that do some calculations
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses] # Extract the answer and check if it is a digit

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig( #different from the one we used for another fine-tuning
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 6,
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250, # Set to 250 for a full training run - it will not be perfect but will see 
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [ # always with the reward functions
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train() # can take more than 1 hour, so be patient
# Some rewards will be negative during training, so don't worry about it.

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 59,867,136/3,000,000,000 (2.00% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
Mr. Benson bought 12 concert tickets, so 2 of the tickets were eligible for a 5% discount because he bought more than 10 tickets. The cost of a ticket is $40. The first 10 tickets cost $40 each, and the 12th ticket gets a 5% discount.

First, let's calculate the cost for the first 10 tickets:
\[ 10 \times 40 = 400 \]

The price of the 12th ticket with a 5% discount is:
\[ 40 \times (1 - 0.05) = 40 \times 0.95 = 38 \]

Next, let's add up the total cost:
\[ 400 + 38 = 438 \]

So, Mr. Benson paid a total of $438.

</reasoning>
<answer>
438
</answer> 
Extracted:
438


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.0945,0.904592,277.0,0.0,-0.4055,0.0,0.0,0.166667,0.333333
2,-0.0,-0.965833,0.755753,572.833374,0.0,-0.965833,0.0,0.0,0.0,0.0
3,-0.0,-0.753333,0.223515,362.333344,3e-06,-0.753333,0.0,0.0,0.0,0.0
4,0.0,-0.1125,1.117606,280.333344,7e-06,-0.529167,0.0,0.0,0.083333,0.333333
5,0.0,-0.122167,0.14232,205.0,8e-06,-0.122167,0.0,0.0,0.0,0.0
6,0.0,-0.557667,0.450421,434.833344,6e-06,-0.557667,0.0,0.0,0.0,0.0
7,0.0,-0.800167,0.310755,453.5,6e-06,-0.800167,0.0,0.0,0.0,0.0
8,0.0,-0.705333,0.115344,315.5,2.1e-05,-0.705333,0.0,0.0,0.0,0.0
9,0.0,-0.448,0.113411,259.333344,9e-06,-0.448,0.0,0.0,0.0,0.0
10,0.0,-0.4805,0.19838,321.333344,6e-06,-0.4805,0.0,0.0,0.0,0.0


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<reasoning>
To solve this problem, we need to calculate the monthly payment for both the house and the trailer, and then determine the difference between them. 

First, we need to calculate the monthly payment for the trailer:
- Cost of trailer: $120,000
- Number of years: 20
- Monthly payment formula: \(\text{Monthly Payment} = \frac{\text{Total Cost} \times c}{1 - (1 + c)^{-n}}\), where \(c\) is the monthly interest rate and \(n\) is the total number of payments (monthly payments).

Assuming an annual interest rate of 5% (0.05 / 12 for monthly interest rate), the total number of payments \(n\) over 20 years would be \(20 \times 12 = 240\).

Using the formula, we can calculate the m

TrainOutput(global_step=250, training_loss=0.0028290148566410592, metrics={'train_runtime': 5580.4069, 'train_samples_per_second': 0.269, 'train_steps_per_second': 0.045, 'total_flos': 0.0, 'train_loss': 0.0028290148566410592})

<a name="Inference"></a>
### Inference
Try inference before adding lora

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.38s/it, est. speed input: 6.88 toks/s, output: 2.98 toks/s]


'There are no letters \'r\' in the word "strawberry".'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
# Try again 
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.01s/it, est. speed input: 14.29 toks/s, output: 27.25 toks/s]


'<reasoning>\nTo find the number of r\'s in the word "strawberry", I will go through the word character by character and count the occurrences of the letter \'r\'.\ns\nt\nr\na\nw\nb\na\nr\nr\ny\nThere are 3 occurrences of the letter \'r\'.\n</reasoning>\n<answer>\n3\n</answer>'

In [None]:
print(output) # check the output and see the difference (reasoning model will be working)

<reasoning>
To find the number of r's in the word "strawberry", I will go through the word character by character and count the occurrences of the letter 'r'.
s
t
r
a
w
b
a
r
r
y
There are 3 occurrences of the letter 'r'.
</reasoning>
<answer>
3
</answer>


## Saving

### Save lora adapter

This is both useful for inference and if you want to load the model again

In [None]:
model.push_to_hub(
    "jr/llama-3.1-8B-Instruct-Reasoning-lora",
    tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

NameError: name 'userdata' is not defined

### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub_gguf(
    "jr/Qwen2.5-3B-Reasoning-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
)