### Installation

In [1]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Load up `taide/Llama-3.1-TAIDE-LX-8B-Chat`, and set parameters

In [3]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "taide/Llama-3.1-TAIDE-LX-8B-Chat",
    model_name = "voidful/Llama-3.1-TAIDE-R1-8B-Chat", # base: meta-llama/Llama-3.1-8B-Instruct merged from taide/Llama-3.1-TAIDE-LX-8B-Chat + deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

INFO 02-23 08:01:07 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading voidful/Llama-3.1-TAIDE-R1-8B-Chat with actual GPU utilization = 49.48%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.56 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 3.39 GB. Also swap space = 6 GB.
INFO 02-23 08:01:25 config.py:549] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defa

tokenizer_config.json:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

INFO 02-23 08:02:09 cuda.py:229] Using Flash Attention backend.
INFO 02-23 08:02:09 model_runner.py:1110] Starting to load model voidful/Llama-3.1-TAIDE-R1-8B-Chat...
INFO 02-23 08:02:10 weight_utils.py:254] Using model weights format ['*.safetensors']


model-00001-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.18G [00:00<?, ?B/s]

INFO 02-23 08:03:02 weight_utils.py:270] Time spent downloading weights for voidful/Llama-3.1-TAIDE-R1-8B-Chat: 51.854443 seconds


model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 02-23 08:03:08 model_runner.py:1115] Loading model weights took 15.9693 GB
INFO 02-23 08:03:08 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-23 08:03:17 worker.py:267] Memory profiling takes 8.68 seconds
INFO 02-23 08:03:17 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.49) = 19.57GiB
INFO 02-23 08:03:17 worker.py:267] model weights take 15.97GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 2.42GiB.
INFO 02-23 08:03:18 executor_base.py:111] # cuda blocks: 1238, # CPU blocks: 3072
INFO 02-23 08:03:18 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 19.34x
INFO 02-23 08:03:22 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error 

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:25<00:00,  1.13s/it]

INFO 02-23 08:03:48 model_runner.py:1562] Graph capturing finished in 26 secs, took 0.31 GiB
INFO 02-23 08:03:48 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 39.74 seconds





tokenizer_config.json:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

voidful/Llama-3.1-TAIDE-R1-8B-Chat does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
tokenizer.special_tokens_map

{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|eot_id|>',
 'pad_token': '<|finetune_right_pad_id|>'}

In [5]:
from transformers import  AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("taide/Llama-3.1-TAIDE-LX-8B-Chat", use_fast=False)

tokenizer_config.json:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [6]:
hf_tokenizer.special_tokens_map

{'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>'}

In [46]:
# SYSTEM_PROMPT = """You first thinks about the think process in the mind and then provides the user with the answer while think step by step, and putting the final answer within \\boxed{}.The think process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> think process here </think><answer> answer here </answer>."""
SYSTEM_PROMPT = """一步一步思考推理過程，將推理過程夾在 <think></think>，當推理完成後將答案夾在 <answer></answer> 標籤中。
輸出範例：
<think>{思考過程}</think>
<answer>{答案}</answer> <|eot_id|>"""
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    # {"role" : "user", "content" : "strawberry 有幾個字母 r?"},
    {"role" : "user", "content" : "Bert 每天都會填寫報紙上的每日填字遊戲。他每兩個星期填一次字謎，就要用掉一支鉛筆。平均來說，他用完一支鉛筆需要 1050 個字。每個字謎平均有多少個字？"}
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.9,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text
print("\n正解為：3")
print('\n\n模型輸出：')
print(output)

Processed prompts: 100%|██████████| 1/1 [00:13<00:00, 13.16s/it, est. speed input: 13.45 toks/s, output: 64.46 toks/s]


正解為：3


模型輸出：
首先，Bert 每兩個星期填一次字謎，也就是每兩周他會參加一次。每年有 52 個星期，所以一年有 52 / 2 = 26 個字謎。

他每年要用 26 支鉛筆，一支鉛筆可用 1050 個字，那一年他要用 26 支鉛筆，一支就 1050 次，所以一年用了 26 * 1050 = 273,000 次的字。

每支筆用完的字數都是 1050 次，一支筆的 1050 次是用在每支的字謎上。

每年字謎的字數是 273,000 次，一支筆 1050 次是 1050 次，一年字謎的字數是 273,000 次。

每支筆的 1050 次是用於每支的字謎上，每年字謎的字數是 273,000 次，所以每支的字謎上都是 1050 次。

要算每個字謎的平均字數，就要知道每支筆的 1050 次是用在每支的字謎上，一年字謎的字數是 273,000 次，所以每支的字謎上都是 1050 次。 
然後把 1050 次除以每年字謎的字數 273,000 次，就可以得出每個字謎的平均字數。

<think> 
Bert 每年參加 26 次字謎，一年用到 26 支鉛筆，一支筆每次用 1050 次字，一年用 26 支筆， 26 次字謎的字數是 273,000 次，一支筆 1050 次是用在每支的字謎上，每支的字謎上都是 1050 次。要算每個字謎的平均字數，就要把 1050 次除以 273,000 次字數。 
</think>

<answer>
每個字謎的平均字數是 35 個字。
</answer>

我把 1050 除以 273,000 次，得到 3.5 然後把 0.5 賣掉，因為字數是 3.5 次的字數。字數是 3.5 次的字數，是 3 次的字數。字數是 3 次的字數是 3 次的字數。字數是 3 次的字數是 3 次的字數。字數是 3 次的字數是 3 次的字數。 
</answer> 

這種處理是不合理的。基本上， 1050 除以 273,000 次是 3.5 所以每年字謎的字數是 273,000 次。每支的字謎上都是 1050 次。要算每個字謎的平均字數，就把 1050 次除以 273,000 次字數。 

每支的字謎上都是 1050 次，所以每支的字謎上 1050 次是用在每支的字謎上。一年字謎的字數是 273,000 次。每支的字謎上都是 1050 次。要算每個




In [47]:
text

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n你是一個來自台灣的AI助理，你的名字是 TAIDE，樂於以台灣人的立場幫助使用者，會用繁體中文回答問題。\n一步一步思考推理過程，推理過程夾在 <think></think>，當推理完成後將答案夾在 <answer></answer> 標籤中。\n輸出範例：\n<think>{思考過程}</think>\n<answer>{答案}</answer> <|eot_id|><|eot_id|><|start_header_id|>user<|end_header_id|>\n\nBert 每天都會填寫報紙上的每日填字遊戲。他每兩個星期填一次字謎，就要用掉一支鉛筆。平均來說，他用完一支鉛筆需要 1050 個字。每個字謎平均有多少個字？<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [83]:
messages = [
  {"role": "system",
  "content": "You first thinks about the reasoning process in the mind and then provides the user with the answer while reasoning step by step, and putting the final answer within \\boxed{}.The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>."},
  {"role": "user", "content": f"早餐喝早餐店的奶茶會導致烙賽為什麼?"},
]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False
)

output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text
print("\n正解為：3")
print('\n\n模型輸出：')
print(output)



Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][A
Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.01s/it, est. speed input: 19.49 toks/s, output: 63.96 toks/s]


正解為：3


模型輸出：
<think> 關於早餐喝早餐店的奶茶會導致「烙賽」的原因，我需要從日常生活的角度和生理反應來分析。烙賽是一個不太常見的醫學術語，通常與腸胃或消化系統的問題相關。 

可能的原因有： 
1. 早餐店的奶茶可能含高糖或高咖啡因的成分，這會刺激胃的分泌，導致胃酸過多，進而引起消化不良或胃灼熱的感覺。
2. 一些早餐店的奶茶可能使用低質的奶精或人工添加物，這會在體內引起不良的化學反應，可能會導致腸胃不舒服或引起其他的症狀。
3. 個人的消化系統狀況不同，可能有胃病、胃食管狹窄、或其他腸胃疾病的人，喝了不合適的飲料會引起反應。
4. 早餐吃了高油高糖的食物，接著又喝高糖的飲料，會增加身體的負擔，導致腸胃的不適。
5. 天氣、壓力、或作息的因素也可能影響個人的腸胃狀況，喝了不合適的飲料可能會引起反應。

綜合上述的原因，早餐喝早餐店的奶茶可能導致「烙賽」的主要原因是個人的胃系統對高糖或高咖啡因的飲料敏感，或是體內已經存在某些不適的狀況，導致在喝了這類飲料後出現不適的症狀。 

至於「烙賽」這個詞的定義，可能是指一種不舒服的、燒灼的感覺，可能是從嘴到胃的整個腸胃都有這種感覺。這可能是因為消化系統的問題，或是其他的生理反應所導致的。 

所以，建議在選擇早餐的飲料時，可以多選擇低熱糖、低咖啡因、或不加奶精的飲料，以減少對身體的刺激和傷害。同時，也要注意自己的身體狀況和消化系統的健康，如果有任何不舒服的症狀出現，應及時就醫檢查。 

在回答這個問題的時候，我主要是從日常生活的角度和生理反應的可能來分析，提供一個比較有可能的解釋。 

在最後，我也建議在選擇早餐的飲料時，可以多選擇低熱糖、低咖啡因、或不加奶精的飲料，以減少對身體的刺激和傷害。同時，也要注意自己的身體狀況和消化系統的健康，如果有任何不舒服的症狀出現，應





### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [52]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
# SYSTEM_PROMPT = """
# Respond in the following format:
# <think>
# ...
# </think>
# <answer>
# ...
# </answer>
# """

XML_COT_FORMAT = """\
<think>
{think}
</think>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# 定義顏色常數
RED = '\033[91m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
CYAN = '\033[96m'
RESET = '\033[0m'



# Reward functions
# def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
#     responses = [completion[0]['content'] for completion in completions]
#     q = prompts[0][-1]['content']
#     extracted_responses = [extract_xml_answer(r) for r in responses]
#     # print('-'*30, f"\n原始問題:\n{q}", f"\n正確答案:\n{answer[0]}", f"\n模型回答:\n{responses[0]}", f"\n解析<answer>標籤:\n{extracted_responses[0]}")
#     print(f"{CYAN}{'-'*30}{RESET}")
#     print(f"{YELLOW}原始問題:{RESET}\n{q}")
#     print(f"{GREEN}正確答案:{RESET}\n{answer[0]}")
#     print(f"{YELLOW}模型回答:{RESET}\n{responses[0]}")
#     print(f"{GREEN}解析<answer>標籤:{RESET}\n{extracted_responses[0]}")
#     return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """只要正確答案出現在模型回答的解析部分中，就給予分數。"""
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print(f"{CYAN}{'-'*30}{RESET}")
    print(f"{YELLOW}原始問題:{RESET}\n{q}")
    print(f"{GREEN}正確答案:{RESET}\n{answer[0]}")
    print(f"{YELLOW}模型回答:{RESET}\n{responses[0]}")
    print(f"{GREEN}解析<answer>標籤:{RESET}\n{extracted_responses[0]}")
    return [2.0 if a in r else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

# def soft_format_reward_func(completions, **kwargs) -> list[float]:
#     """放寬 </answer> 為可選的格式檢查"""
#     pattern = r"<think>.*?</think>\s*<answer>.*?(</answer>\s*)?$"
#     responses = [completion[0]["content"] for completion in completions]
#     matches = [re.match(pattern, r, re.DOTALL) for r in responses]
#     return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<think>\n") == 1:
        count += 0.125
    if text.count("\n</think>\n") == 1:
        count += 0.25
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [53]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 12, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 512,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 12


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [54]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 12 | Gradient Accumulation steps = 1
\        /    Total batch size = 12 | Total steps = 250
 "-____-"     Number of trainable parameters = 167,772,160


[96m------------------------------[0m
[93m原始問題:[0m
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all?
[92m正確答案:[0m
476
[93m模型回答:[0m
首先，計算一年內Janet花在各項課程上的總費用。

一年有52周（因為一年有52周每周有一周的時間）。

1. 小提琴課：每周5小時，費用28元/小時。每周費用是5*28=140元。
1年費用：130*1周（5週）=5*28*52周=5*28*52

2. 單簧管課：每周3小時，費用40元/小時。每周費用是3*40=120元。
1年費用：3*40*52周=3*40*52

將兩者的費用相加，即可得知她在一年內花在鋼琴課和管上所有的費用。

1,000,000元是52周的費用，不涉及此處。基本上，1年後的費用是2年，因此將52週的費用*2。


1年花在鋼琴課的費用：5*28*52= 5*28*52= 5*28*52= 1420（彩球）


1年花在管上的費用：3*40*52= 3*40*52= 3*40*52= 3*40*52=  1200（彩球）

 LAlian有52周。

1年花在鋼琴課的費用比1年花在管上的費用多：5*28*52-3*40*52 =  (5*28*52-3*40*52) = "1000-1000-1"，之類。

5*28*52 - 3*40*52 = (5*28*52 - 3*40*52) = 

1年花在鋼琴課的費用比1年花在管上的費多   5*28*52 - 3*40*52 =  (1100)

這是三年的花1000000元，因為她每周上52周的課。 52*5*28的花1000000元，她參加一年等於52周的課。

等，我想，1年花在1周的琴課是5*28*52=1,120,000元。1年花在1周的管課是3*40*52=1,200,000元。1年花在1周的管課上比1年花在1周的琴
[92m解析<answer>標籤:[0m
首先，計算一年內Jan

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.278833,0.603186,512.0,0.000772,0.112167,0.0,0.0,0.0,0.166667
2,0.0,0.230833,0.701351,423.75,0.000641,0.064167,0.0,0.0,0.0,0.166667
3,0.0,0.45225,0.871405,442.5,0.000787,0.035583,0.0,0.0,0.083333,0.333333
4,0.0,0.955833,1.135106,394.0,0.000639,0.080833,0.0,0.0,0.041667,0.833333
5,0.0,1.197167,1.119453,370.916687,0.000731,0.113833,0.041667,0.0,0.041667,1.0
6,0.0,0.343333,0.580749,405.666687,0.000625,0.135,0.0,0.0,0.041667,0.166667
7,0.0,0.083333,0.144338,492.166687,0.000492,0.041667,0.0,0.0,0.041667,0.0
8,0.0,-0.059417,0.247809,470.166687,0.000561,-0.059417,0.0,0.0,0.0,0.0
9,0.0,-0.03775,0.13077,452.25,0.000723,-0.03775,0.0,0.0,0.0,0.0
10,0.0,1.79625,0.760861,492.916687,0.000569,-0.037083,0.0,0.0,0.0,1.833333


[1;30;43m串流輸出內容已截斷至最後 5000 行。[0m
紅燈：每 2 秒點亮一次 → 下次亮：t+2 秒
綠燈：每 3 秒亮一次 → 下次亮：t+3 秒
藍燈：每 4 秒亮一次 → 下次亮：t+4 秒
當它們被關掉，然後再啟動的時候，它們下次同時亮的時間是它們被關掉的時間加上它們各自的點燈時間。
但在這種案例中，不是關掉，而是都被啟動的同時。要找的是同時啟動後，它們一次全部亮的時間。
為了使所有三個燈在最短的時間後都亮，我們需要找到一個共同的時間間隔，使它們都在同一個時間點被啟動。
此時間是三個時間的最小公倍數（LCM）。
紅燈的時間是 2 秒，綠燈是 3 秒，藍燈是 4 秒。
三個時間的最小公倍數是 12 秒。
這是因為 12 是 2、3、4 的最小公倍數，因此，它能被每一個燈的啟動間隔所整除。
因此，在 12 秒的時間過後，所有的燈都會被啟動，它們將在這時刻同時亮。
</think>
<answer>
12 秒</answer>
 learner不過，12 秒是指三個燈各自的啟動時間都被啟動，然後一次亮的時間，但實際上在這些燈啟動的同時，它們都亮，這是指在同時啟動的事被解決。換句話說，實際上在同一個時間點，所有燈都被啟動，因此 12 秒是指過了一段時間，在此時，所有燈都會亮。這是可行的。而且，最小的時間間隔是12秒。因此，在12秒的時候，所有的燈都會亮。最短的時間是12秒
[92m解析<answer>標籤:[0m
12 秒
[96m------------------------------[0m
[93m原始問題:[0m
Dale just learned how to make homemade macaroni and cheese. He decided to make a big batch for his family reunion. The original recipe calls for 2 pounds of pasta and serves 7 people. Dale's family reunion will have 35 people. How many pounds of pasta does Dale need to buy?
[92m正確答案:[0m
10
[93m模型回答:[0m
<thi

TrainOutput(global_step=250, training_loss=0.0006522387082877686, metrics={'train_runtime': 3296.5327, 'train_samples_per_second': 0.91, 'train_steps_per_second': 0.076, 'total_flos': 0.0, 'train_loss': 0.0006522387082877686})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [106]:
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role" : "user", "content" : "strawberry 有幾個字母「r」?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_p = 0.9,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

print("正解為：3個")
print('\n\n模型輸出：')
print(output)

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.48s/it, est. speed input: 89.81 toks/s, output: 62.80 toks/s]

正解為：3個


模型輸出：
<think>我要計算「strawberry」這個字中有幾個「r」這個字母。</think>
「strawberry」從頭開始看，S 是第一個字母，然後是 t、r、a、w、b、r、r、y。所以，從這段內容中，我可以數到有兩個「r」的出現。</think>
<answer>2</answer>





And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [65]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [101]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "strawberry 有幾個字母「-r」?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.9,
    top_p = 0.9,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text
print('\n\n')
print(output)

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.14s/it, est. speed input: 62.78 toks/s, output: 50.60 toks/s]




<think> 首先，我需要計算 "strawberry" 這個字中 "r" 的出現次數。"strawberry" 可以被拆成每一個字母來看：s-t-r-a-w-b-e-r-r-y。接著，我數出每一個出現的 r：在第三個字的位置是 r，在最後兩個字的位置各有一次 r。所以，"strawberry" 有 3 個 "r" 。</think>  
<answer>3</answer>





Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [107]:
from google.colab import userdata


In [109]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
# if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("LeeTung/Llama-3.1-TAIDE-R1-8B-Chat-GRPO", tokenizer, quantization_method = "q4_k_m", token = userdata['HF_TOKEN'])

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 46.59 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 67.57it/s]


Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at LeeTung/Llama-3.1-TAIDE-R1-8B-Chat-GRPO into bf16 GGUF format.
The output location will be /content/LeeTung/Llama-3.1-TAIDE-R1-8B-Chat-GRPO/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Llama-3.1-TAIDE-R1-8B-Chat-GRPO
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
IN

RuntimeError: *** Unsloth: Failed compiling llama.cpp with WARNING:hf-to-gguf:**************************************************************************************
. Please report this ASAP!