# RL 训练

In [1]:
import os
# fix：https://github.com/unslothai/unsloth/issues/2299#issuecomment-2782067709
os.environ["VLLM_USE_V1"] = '0'
# 国内需要禁止统计，否则会卡在模型加载的地方（连不到外网）
os.environ["UNSLOTH_DISABLE_STATISTICS"] = "0"

from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/data/countdown/output/models/Qwen2.5-1.5B-Instruct", # change to your model path
    # 使用 merge.ipynb 合并后的模型作为 RL 的基座，从而解决冷启动的问题。
    # model_name = "/home/cynic/models/qwen2.5-1.5b-sft-distill-merged",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    local_files_only=True,
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank * 2,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-22 15:11:15 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.4.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 22.159 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /data/countdown/output/models/Qwen2.5-1.5B-Instruct with actual GPU utilization = 32.54%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens 

[W422 15:11:47.714048298 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 04-22 15:11:48 [loader.py:1166] Loading weights with BitsAndBytes quantization. May take a while ...


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-22 15:11:49 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-22 15:11:49 [model_runner.py:1146] Model loading took 1.2131 GiB and 0.970670 seconds
INFO 04-22 15:11:51 [worker.py:267] Memory profiling takes 1.62 seconds
INFO 04-22 15:11:51 [worker.py:267] the current vLLM instance can use total_gpu_memory (22.16GiB) x gpu_memory_utilization (0.33) = 7.21GiB
INFO 04-22 15:11:51 [worker.py:267] model weights take 1.21GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 4.87GiB.
INFO 04-22 15:11:51 [executor_base.py:112] # cuda blocks: 11405, # CPU blocks: 14043
INFO 04-22 15:11:51 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 89.10x
INFO 04-22 15:11:54 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes:   0%|          | 0/27 [00:00<?, ?it/s]

INFO 04-22 15:12:19 [model_runner.py:1598] Graph capturing finished in 25 secs, took 2.86 GiB
INFO 04-22 15:12:19 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 30.24 seconds


Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [2]:
from constant import SYSTEM_PROMPT, USER_PROMPT_TPL, parse_user_prompt
from datasets import load_dataset
def get_countdown_questions(data_file):
    data = load_dataset("json", data_files=data_file)["train"]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', "content": parse_user_prompt(USER_PROMPT_TPL, x["numbers"], x["target"])}
        ],
    })
    return data

dataset = get_countdown_questions("data/rl_data_simple.jsonl")

dataset[0]

{'numbers': [77, 7, 77],
 'target': 7,
 'ground_truth_solution': '(77 / 77) * 7',
 'prompt': [{'content': 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.',
   'role': 'system'},
  {'content': 'Using the numbers 77, 7, 77, create an equation that equals 7. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once, and you must use all the numbers. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer>(1 + 2) / 3</answer>. Think step by step inside <think> tags.',
   'role': 'user'}]}

In [3]:
eval_dataset = get_countdown_questions("data/test_simple.jsonl")
eval_dataset[0]

{'numbers': [89, 14, 768],
 'target': 843,
 'ground_truth_solution': '(89 - 14) + 768',
 'prompt': [{'content': 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.',
   'role': 'system'},
  {'content': 'Using the numbers 89, 14, 768, create an equation that equals 843. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once, and you must use all the numbers. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer>(1 + 2) / 3</answer>. Think step by step inside <think> tags.',
   'role': 'user'}]}

In [4]:
import re
def extract_xml_answer(response):
    pattern = r"<answer>(.*?)</answer>"
    match = re.search(pattern, response, re.DOTALL)
    return match.group(1) if match else ""

# Reward function that checks if the completion follows a more relaxed format
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

# 正确性奖励
def correctness_reward_func(completions, numbers, target, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]

    def correct(numbers, target, solution):
        try:
            if '=' in solution:
                # 删除掉等号以及等号后的内容
                solution = solution.split('=')[0]
            if not eval(solution) == target:
                # 如果 solution 计算结果不等于 target，则认为不正确
                return False, "Solution is not equal to target"
            # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
            allowed_pattern = r'^[\d+\-*/().\s]+$'
            if not re.match(allowed_pattern, solution):
                # 如果 solution 不符合要求，则认为不正确
                return False, "Solution is not allowed"
            # 检查 solution 是否使用了所有数字
            used_numbers = [int(n) for n in re.findall(r'\d+', solution)]
            if sorted(used_numbers) != sorted(numbers):
                # 如果 solution 没有使用所有数字没有仅使用一次，则认为不正确
                return False, "Solution is not using all numbers or each number is not used only once"
            return True, "Solution is correct"
        except Exception as e:
            # 计算失败，则认为不正确
            return False, f"Solution calculation failed: {e}"
    
    scores = []
    for i, r in enumerate(extracted_responses):
        scores.append(2 if correct(numbers[i], target[i], r)[0] else 0)
    
    return scores

In [5]:
# test reward function
from datasets import load_from_disk
eval_dataset = load_from_disk("output/test_vllm__qwen2.5-1.5b-sft-distill-lora_results")
idx = 11
eval_dataset[idx]

{'target': 656,
 'numbers': [85, 36, 16, 519],
 'prompt': 'Using the numbers 85, 36, 16, 519, create an equation that equals 656. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once, and you must use all the numbers. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer>(1 + 2) / 3</answer>. Think step by step inside <think> tags.',
 'completion': "<think>Okay, let's see. I need to make an equation using 85, 36, 16, and 519, each exactly once, with basic operations, and the result should be 656. Hmm.\n\nFirst, maybe I should look at the target number, 656. Let me think if there's any way to break it down into parts that could relate to these numbers. For instance, 656 divided by something... Maybe 656 is close to 519 plus something? Wait, 519 is already given as part of the numbers. So maybe 519 plus what else would get me to 656?\n\nWait, 519 + 137 = 656. But h

In [6]:
print(f'Soft format reward: {soft_format_reward_func([[{"content": eval_dataset[idx]["completion"]}]])}')
score = correctness_reward_func(
    [[{"content": eval_dataset[idx]["completion"]}]],
    [eval_dataset[idx]["numbers"]],
    [eval_dataset[idx]["target"]])
print(f'Correctness reward: {score}')

Soft format reward: [0.5]
Correctness reward: [2]


In [7]:
import wandb
wandb.init(project="countdown-rl-simple")

[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mswulling[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [8]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.01,
    lr_scheduler_type = "constant",
    optim = "adamw_8bit",
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 32,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    vllm_max_model_len= max_seq_length,
    max_prompt_length = 1024,
    max_completion_length = 1024,
    temperature = 1.0, # set to 1.0 for more diverse responses
    num_train_epochs = 4, # Set to 1 for a full training run
    #max_steps = 250,
    save_steps = 100,
    max_grad_norm = 0.1,
    output_dir = "output/rl3",
    beta=0.001,
    report_to = "wandb", # Can use Weights & Biases
    log_completions=True,
    logging_steps=1,
    # unsloth grpo eval bug: https://github.com/unslothai/unsloth/issues/2367
    # do_eval=True,
    # eval_strategy="steps",
    # eval_steps=1,
    # per_device_eval_batch_size = 32,
)

In [9]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        soft_format_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
    # eval_dataset = eval_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 4 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 1 x 1) = 32
 "-____-"     Trainable parameters = 73,859,072/5,000,000,000 (1.48% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / soft_format_reward_func,rewards / correctness_reward_func
1,-0.0,0.171875,0.296149,262.3125,0.0,0.109375,0.0625
2,0.0,0.15625,0.319843,284.0,0.00089,0.09375,0.0625
3,0.0,0.125,0.224615,265.40625,0.000851,0.125,0.0
4,0.0,0.203125,0.247236,356.1875,0.001038,0.203125,0.0
5,0.0,0.328125,0.413394,240.3125,0.001162,0.203125,0.125
6,0.0,0.125,0.224615,252.21875,0.001371,0.125,0.0
7,0.0,0.25,0.263018,289.09375,0.002018,0.25,0.0
8,0.0,0.25,0.360842,329.5625,0.001911,0.1875,0.0625
9,0.0,0.171875,0.251945,292.5,0.002283,0.171875,0.0
10,0.0,0.3125,0.409942,238.625,0.003364,0.25,0.0625


Unsloth: Will smartly offload gradients to save VRAM!




TrainOutput(global_step=1000, training_loss=0.0011571628644425607, metrics={'train_runtime': 11039.6631, 'train_samples_per_second': 0.362, 'train_steps_per_second': 0.091, 'total_flos': 0.0, 'train_loss': 0.0011571628644425607})

v3: https://wandb.ai/swulling/countdown-rl-simple/runs/8yeq1v0l?nw=nwuserswulling

In [10]:
model.save_pretrained("output/qwen2.5-1.5b-rl-v3-lora")  # Local saving lora weights
tokenizer.save_pretrained("output/qwen2.5-1.5b-rl-v3-lora")

('output/qwen2.5-1.5b-rl-v3-lora/tokenizer_config.json',
 'output/qwen2.5-1.5b-rl-v3-lora/special_tokens_map.json',
 'output/qwen2.5-1.5b-rl-v3-lora/vocab.json',
 'output/qwen2.5-1.5b-rl-v3-lora/merges.txt',
 'output/qwen2.5-1.5b-rl-v3-lora/added_tokens.json',
 'output/qwen2.5-1.5b-rl-v3-lora/tokenizer.json')


```bash
vllm serve output/models/Qwen2.5-1.5B-Instruct --port 8100 --api-key NLUKKXIJDZ91rpg1z --enforce-eager  --max-model-len 4096 --enable-lora --max-lora-rank 64 --lora-modules qwen2.5-1.5b-rl-v3-lora=output/qwen2.5-1.5b-rl-v3-lora

CURATOR_VIEWER=1 python eval.py --provider vllm --data_path data/test_simple.jsonl --model_name qwen2.5-1.5b-rl-v3-lora --temperature 0.01 --max_tokens 2048

https://curator.bespokelabs.ai/datasets/45c312d3eab3487aa2310120ba91820b

Accuracy: 49/100 (45.00%)
```

一开始的模型（除了格式错误外）：(Step 20)
```
<think>
We need to create an equation that equals 595 using the numbers 490, 52, 53, with each number used exactly once. Let's try to manipulate these numbers.
</think>
<think>
Let's start by adding 52 and 53 to get 105.
490 / 105 = 4.85714285714
So, we have 4.85714285714 left to add to reach 595. But we can't use 105 again.
Let's subtract 53 from 490.
490 - 53 = 437
Now we have 437 left. Let's add 52 to it.
437 + 52 = 489
Then let's subtract 437 from 595.
490 / 52 = 9.61538461538
Since we've run out of numbers, 437 is our final sum.
</think>
<answer>(490 - 53) * 9.61538461538 + 489</answer>
```
自然语言很多，这些应该来自模型自身的训练数据。但是结果不对。

中间0.5格式化奖励示例（Step 114）
```
<think>
100 - 63 = 37
37 + 3 = 40
40 + 34 = 121
</think>
<answer>(100 - 63) + 3 = 40 + 34 = 121</answer>
```
可以发现模型纯粹就是瞎编以满足 0.5 的格式奖励。同时模型的输出长度开始剧烈下降


模型逐渐学习到根据low-high穷举，然后输出长度变长做更多的尝试（Step 488）
```
<think>
599 + 44 = 643
643 - 33 = 610 (too high, try different numbers)
44 + 599 = 643
643 - 33 = 610 (too low, try different numbers)
44 * 599 = 25956 (too high, try different numbers)
44 / 599 = 0.0735 (too low, try different numbers)
33 + 44 = 77
77 + 599 = 676
</think>
<answer>(33 + 44) + 599 = 676</answer>
```
但是发现模型只能做对比较简单的题目（加减法），两三位的乘除法由于模型自身能力问题，算不对。比如 44 * 599 就算错了。


模型开始具备一定的难题（乘法、除法）的能力，输出变得更长（Step 761）
```
<think>
588 + 42 = 630
630 + 28 = 658 (too small, try another approach)
42 * 28 = 1176 (too large, try using subtraction)
42 / 28 = 1.5 (not an integer, try another combination)
42 + 588 = 630
630 - 28 = 602 (too small, try another approach)
588 - 42 = 546
546 + 28 = 574 (too small, try another combination)
588 / 28 = 21
21 * 42 = 882 (correct)
</think>
<answer>(588 / 28) * 42 = 882</answer>
```
```
<think>
40 * 74 = 2960
2960 / 25 = 118.4 (not 850, try another approach)
74 * 40 = 2960
2960 / 25 = 118.4 (not 850, try another approach)
74 + 40 = 114
114 * 25 = 2850 (too small, try a different approach)
74 - 40 = 34
34 * 25 = 850 (correct)
</think>
<answer>(74 - 40) * 25 = 850</answer>
```

到这个层级，每道问题模型都有一定概率算对，但是正确率不高。其实这种情况比较适合RL，也就是说只要模型增加输出长度，做更多的自我一致性，就能答对。
但是可能是长度限制，模型的输出长度一直上不去。
