<a href="https://colab.research.google.com/github/henrhie/A_star/blob/master/nb/Qwen2.5_(3B)-GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.8.5.post1

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth-zoo==2025.7.4
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024
lora_rank      = 64

# 1. Load a clean “System 1” model (no LoRA)
sys1, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length           = max_seq_length,
    load_in_4bit             = True,
    fast_inference           = True,
    gpu_memory_utilization   = 0.5,
)

# Freeze its parameters so no gradients flow here
for p in sys1.parameters():
    p.requires_grad_(False)
sys1.eval()


In [4]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [6]:
# 2. Load a second, separate model for “System 2”
sys2, _ = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",  # same checkpoint
    max_seq_length           = max_seq_length,
    load_in_4bit             = True, # Load without 4bit
    fast_inference           = False, # Load without fast_inference
    gpu_memory_utilization   = 0.5, # This is not needed for the second model loaded this way
)

# 3. Wrap System 2 in LoRA adapters
sys2 = FastLanguageModel.get_peft_model(
    sys2,
    r                       = lora_rank,
    target_modules         = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha             = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state           = 3407,
)

==((====))==  Unsloth 2025.7.5: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.7.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [15]:
import re
from datasets import load_dataset, Dataset

SYS_1_PROMPT = """
You are System 1, the final answer engine.
Your sole job is to read the original question plus the provided chain-of-thought reasoning, then output the **concise, correct final answer**.
Follow these rules:

1. **Role & Tone**
   - You are concise and authoritative.
   - Provide only the final answer—no reasoning steps or commentary.

2. **Input Format**
   Every input will be structured as:
     ```
     Question: ...
     Reasoning: ...
     ```
   - You must respect the reasoning; do not ignore it or re-solve from scratch.

3. **Output Format**
   - Directly write the answer after the “Answer:” cue.
   - If a numeric answer, write only the number (with units if asked).
   - If a textual answer, write only the text.
   - Do not include “Question:” or “Reasoning:” in your output.
   - Ensure that your final answer is placed in answer tags. eg. <answer>Your answer here</answer>

4. **Verification**
   - If the reasoning contradicts itself or is unclear, choose the most logical interpretation.
   - If you absolutely cannot derive an answer, respond with “I’m sorry, I cannot determine the answer from the provided reasoning.”

Question: {question}
Reasoning: {reasoning}
Answer:
"""

SYS_2_PROMPT = """
You are System 2, the deliberate reasoning engine.
Your sole job is to think **step by step** and produce a **chain of thought** (CoT)
that leads to the solution—**do not** ever state or output the final answer itself.
When given a question or prompt, follow these rules:

1. **Role & Tone**
   - You are thorough, precise, and logical.
   - Use clear, numbered or bullet-style steps.
   - Do NOT include any final answer or conclusion in your output.

2. **Supported Query Types**
   - **Math/Logic:** show all intermediate calculations.
   - **Commonsense/Narrative:** articulate each inference.
   - **Code-related:** explain algorithm design, data structures, and line-by-line logic.
   - **General knowledge:** cite relevant facts and sources as needed.

3. **Format**
   Begin every response exactly as:
     ```
     Question: {original question}
     Let's think step by step:
     1. …
     2. …
     3. …
     ```
   - Each step should be on its own line, numbered or bullet-pointed.
   - End your output when the reasoning is complete, but **stop before giving the final answer**.

4. **Cheat Prevention**
   - Do **not** include phrases like “therefore the answer is …” or reveal the numeric/textual answer.
   - If you accidentally deduce the answer, rephrase as an inference step only (e.g. “This leads us to identify the key value”).

5. **Length & Clarity**
   - Keep maximum tokens to your configured limit (e.g., 128 tokens).
   - Be as concise as possible while still covering all necessary reasoning.

6. **Example**
   **Prompt:** “If 3x + 5 = 20, what is x?”
   **Your output:**
"""

# Load and prep dataset
# SYSTEM_PROMPT = """
# Respond in the following format:
# <reasoning>
# ...
# </reasoning>
# <answer>
# ...
# </answer>
# """

# XML_COT_FORMAT = """\
# <reasoning>
# {reasoning}
# </reasoning>
# <answer>
# {answer}
# </answer>
# """

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYS_2_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [25]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def sys_1_reward(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    inputs = [
        SYS_1_PROMPT.format(question = q, reasoning = r)
        for r in responses
    ]

    batch = tokenizer(
      inputs,
      return_tensors="pt",
      padding=True,
      truncation=True,
    ).to(device)
    out_ids = sys1.generate(
      input_ids      = batch["input_ids"],
      attention_mask = batch["attention_mask"],
      max_new_tokens = 32,
      do_sample      = True,
      pad_token_id   = tokenizer.eos_token_id,
    )
    raw_outputs = tokenizer.batch_decode(out_ids, skip_special_tokens=True)

    print("answer---->: ", answer)
    extracted_responses = [extract_xml_answer(r) for r in raw_outputs]
    print("outputs---->: ", extracted_responses)
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [26]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    # use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [27]:
trainer = GRPOTrainer(
    model = sys2,
    processing_class = tokenizer,
    reward_funcs = [
        sys_1_reward
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 73,859,072 of 1,617,573,376 (4.57% trained)


answer---->:  ['476', '476', '476', '476', '476', '476', '476', '476']
outputs---->:  ['You are System 1, the final answer engine.  \nYour sole job is to read the original question plus the provided chain-of-thought reasoning, then output the **concise, correct final answer**.  \nFollow these rules:\n\n1. **Role & Tone**  \n   - You are concise and authoritative.  \n   - Provide only the final answer—no reasoning steps or commentary.\n\n2. **Input Format**  \n   Every input will be structured as:\n     ```\n     Question: ...\n     Reasoning: ...\n     ```\n   - You must respect the reasoning; do not ignore it or re-solve from scratch.\n\n3. **Output Format**  \n   - Directly write the answer after the “Answer:” cue.  \n   - If a numeric answer, write only the number (with units if asked).  \n   - If a textual answer, write only the text.  \n   - Do not include “Question:” or “Reasoning:” in your output.\n\n4. **Verification**  \n   - If the reasoning contradicts itself or is unclear, 

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / sys_1_reward / mean,rewards / sys_1_reward / std
1,0.0,0.0,0.0,196.0,180.0,200.0,0.75,184.0,180.0,188.0,0.0,0.0,0.0
2,0.0,0.0,0.0,176.0,88.0,200.0,0.75,104.0,88.0,120.0,0.0,0.0,0.0
3,0.0,0.0,0.0,181.0,85.0,200.0,0.75,124.0,85.0,163.0,0.0,0.0,0.0
4,0.0,0.0,0.0,137.625,96.0,200.0,0.125,128.714294,96.0,185.0,0.0,0.0,0.0
5,0.0,0.0,0.0,129.125,78.0,200.0,0.125,119.000008,78.0,188.0,0.0,0.0,0.0
6,0.0,0.0,0.0,109.75,43.0,200.0,0.125,96.857147,43.0,183.0,0.0,0.0,0.0
7,0.0,0.0,0.0,184.25,110.0,200.0,0.5,168.5,110.0,198.0,0.0,0.0,0.0
8,0.0,0.0,0.0,143.5,23.0,200.0,0.375,109.599998,23.0,193.0,0.0,0.0,0.0
9,0.0,0.0,0.0,154.75,58.0,200.0,0.375,127.599998,58.0,160.0,0.0,0.0,0.0
10,0.0,0.0,0.0,194.5,173.0,200.0,0.75,178.0,173.0,183.0,0.0,0.0,0.0


answer---->:  ['1500', '1500', '1500', '1500', '1500', '1500', '1500', '1500']
outputs---->:  ['You are System 1, the final answer engine.  \nYour sole job is to read the original question plus the provided chain-of-thought reasoning, then output the **concise, correct final answer**.  \nFollow these rules:\n\n1. **Role & Tone**  \n   - You are concise and authoritative.  \n   - Provide only the final answer—no reasoning steps or commentary.\n\n2. **Input Format**  \n   Every input will be structured as:\n     ```\n     Question: ...\n     Reasoning: ...\n     ```\n   - You must respect the reasoning; do not ignore it or re-solve from scratch.\n\n3. **Output Format**  \n   - Directly write the answer after the “Answer:” cue.  \n   - If a numeric answer, write only the number (with units if asked).  \n   - If a textual answer, write only the text.  \n   - Do not include “Question:” or “Reasoning:” in your output.\n\n4. **Verification**  \n   - If the reasoning contradicts itself or is u

KeyboardInterrupt: 

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.71it/s, est. speed input: 63.38 toks/s, output: 25.69 toks/s]


'There are 2 r\'s in the word "strawberry."'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.06s/it, est. speed input: 14.05 toks/s, output: 29.09 toks/s]


'<reasoning>\nTo find out how many times the letter \'r\' appears in the word "strawberry", we can go through the word character by character and count each occurrence of \'r\'. In "strawberry", the letter \'r\' appears 3 times: once in the beginning, once in the middle, and once at the end of the word.\n</reasoning>\n<answer>\n3\n</answer>'

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
