To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [1]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # 最大序列长度，可以增加以支持更长的推理文本
lora_rank = 32 # LoRA的秩，数值越大模型越智能但训练速度越慢 # 16

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/home/projects/unsloth-training/models/Qwen2.5-3B-Instruct", # 预训练模型的路径，或者hf名称
    max_seq_length = max_seq_length, # 设置最大序列长度
    load_in_4bit = False, # 使用4位量化加载模型，可以节省显存 #True
    fast_inference = True,# 启用vLLM加速推理
    max_lora_rank = lora_rank, # 设置LoRA的最大秩
    gpu_memory_utilization = 0.9, # GPU内存使用率，如果出现OOM可以降低此值
)

model = FastLanguageModel.get_peft_model(
    model, # 预先加载模型
    r = lora_rank, # LoRA的秩，建议值为8, 16, 32, 64或128
    target_modules = ["gate_proj", "up_proj", "down_proj",], # 需要应用LoRA的目标模块
    lora_alpha = lora_rank, # LoRA缩放参数，通常设为与r相同
    use_gradient_checkpointing = "unsloth", # 启用梯度检查点以支持长文本微调
    random_state = 3407,  # 随机数种子，确保结果可重现
)

INFO 03-11 00:49:29 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.8: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4070 Ti SUPER. Num GPUs = 1. Max memory: 15.992 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /home/projects/unsloth-training/models/Qwen2.5-3B-Instruct with actual GPU utilization = 82.75%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 15.99 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 7.37 GB. Also swap space = 2 GB.
INFO 03-11 00:50:13 config.py:549] This model supports multiple tasks: {'classif



Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 03-11 00:50:29 model_runner.py:1115] Loading model weights took 5.7701 GB
INFO 03-11 00:50:29 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-11 00:50:31 worker.py:267] Memory profiling takes 1.45 seconds
INFO 03-11 00:50:31 worker.py:267] the current vLLM instance can use total_gpu_memory (15.99GiB) x gpu_memory_utilization (0.83) = 13.23GiB
INFO 03-11 00:50:31 worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 6.37GiB.
INFO 03-11 00:50:31 executor_base.py:111] # cuda blocks: 11595, # CPU blocks: 3640
INFO 03-11 00:50:31 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 181.17x
INFO 03-11 00:50:31 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error 

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:14<00:00,  1.81it/s]

INFO 03-11 00:50:46 model_runner.py:1562] Graph capturing finished in 15 secs, took 0.32 GiB
INFO 03-11 00:50:46 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 17.06 seconds



Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.3.8 patched 36 layers with 36 QKV layers, 0 O layers and 36 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
# 导入必要的库
import re  # 导入正则表达式库，用于字符串匹配和提取
from datasets import load_dataset, Dataset  # 导入数据集处理相关库

# 定义系统提示，指定响应格式
SYSTEM_PROMPT = """
回答遵循以下格式：
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

# 定义XML格式的思维链(Chain of Thought)格式模板
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    """
    从包含XML标签的文本中提取<answer>标签内的答案
    
    参数:
        text: 包含XML标签的文本
        
    返回:
        str: 提取出的答案文本，去除首尾空格
    """
    answer = text.split("<answer>")[-1]  # 提取<answer>标签后的内容
    answer = answer.split("</answer>")[0]  # 提取</answer>标签前的内容
    return answer.strip()  # 去除首尾空格

def extract_hash_answer(text: str) -> str | None:
    """
    从文本中提取####标记后的答案（用于处理某些特定格式的数据）
    
    参数:
        text: 包含####标记的文本
        
    返回:
        str | None: 提取出的答案文本或None（如果没有####标记）
    """
    if "####" not in text:  # 检查文本中是否有####标记
        return None
    return text.split("####")[1].strip()  # 提取####标记后的内容并去除首尾空格

# 加载数据集的函数
def get_gsm8k_questions(split = "train", local_path="/home/projects/unsloth-training/datasets/ruozhiba_R1/alpaca_output.jsonl") -> Dataset:
    """
    从本地路径加载数据集并进行处理
    
    参数:
        split: 数据集分割，默认为"train"
        local_path: 本地数据集路径
        
    返回:
        Dataset: 处理后的数据集对象
    """
    # 从本地路径加载数据集
    data = load_dataset('json', data_files=local_path, split=split)
    
    # 检查数据集结构，打印第一个样本的键
    example = data[0]
    print("Dataset keys:", example.keys())
    
    # 对数据集进行映射处理，构建适合训练的格式
    data = data.map(lambda x: {
        'prompt': [
            # 添加系统提示作为第一条消息
            {'role': 'system', 'content': SYSTEM_PROMPT},
            # 添加用户问题，优先使用'instruction'字段，如不存在则尝试其他字段
            {'role': 'user', 'content': x['instruction'] if 'instruction' in x else x.get('input', '')}
        ],
        # 提取答案，优先使用'output'字段，如不存在则尝试其他字段
        # 'answer': extract_hash_answer(x['output'] if 'output' in x else x.get('response', x.get('answer', '')))
    })
    return data

# 加载并处理数据集
dataset = get_gsm8k_questions(local_path="/home/projects/unsloth-training/datasets/ruozhiba_R1/alpaca_output.jsonl")

# 以下是各种奖励函数的定义，用于评估模型生成的回答质量

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """
    评估模型回答的正确性，与标准答案进行比较
    
    参数:
        prompts: 提供给模型的问题列表
        completions: 模型生成的完成内容列表
        answer: 标准答案列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 正确回答得2.0分，不正确得0.0分
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]['content'] for completion in completions]
    # 获取当前问题文本
    q = prompts[0][-1]['content']
    # 从回答中提取XML标记的答案内容
    extracted_responses = [extract_xml_answer(r) for r in responses]
    # 打印调试信息，显示问题、正确答案、模型回答和提取的答案
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    # 比较提取的答案与标准答案，正确则返回2.0，错误则返回0.0
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    """
    检查模型回答是否为整数
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 回答为整数得0.5分，否则得0.0分
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]['content'] for completion in completions]
    # 从回答中提取XML标记的答案内容
    extracted_responses = [extract_xml_answer(r) for r in responses]
    # 检查提取的答案是否为数字字符串，是则返回0.5，否则返回0.0
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    严格检查回答是否符合指定的XML格式
    
    格式要求: 必须严格遵循以下格式
    <reasoning>
    [推理内容，可多行]
    </reasoning>
    <answer>
    [答案内容，可多行]
    </answer>
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 格式正确得0.5分，否则得0.0分
    """
    # 定义严格的XML格式正则表达式模式，要求精确匹配开始和结束标签以及换行
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    # 使用正则表达式检查格式是否匹配
    matches = [re.match(pattern, r) for r in responses]
    # 匹配成功返回0.5，否则返回0.0
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """
    宽松检查回答是否符合XML格式
    
    格式要求: 只要包含<reasoning>标签和<answer>标签即可，不严格要求换行和顺序
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 格式正确得0.5分，否则得0.0分
    """
    # 定义宽松的XML格式正则表达式模式，只要求包含标签，不限制换行格式
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    # 使用正则表达式检查格式是否匹配
    matches = [re.match(pattern, r) for r in responses]
    # 匹配成功返回0.5，否则返回0.0
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    """
    计算XML标签的正确使用情况，并给予分数奖励
    
    参数:
        text: 需要评估的文本
        
    返回:
        float: 根据XML标签的正确使用情况计算的分数(最高0.5分)
    """
    count = 0.0
    # 检查是否正确使用<reasoning>标签，正确得0.125分
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    # 检查是否正确使用</reasoning>标签，正确得0.125分
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    # 检查是否正确使用<answer>标签，正确得0.125分
    # 同时减去</answer>后多余内容的惩罚分
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001  # 对多余内容进行惩罚
    # 检查是否正确使用</answer>标签，正确得0.125分
    # 同时减去</answer>后多余内容的惩罚分
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001  # 对多余内容进行惩罚
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """
    评估回答中XML标签的正确使用情况
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 每个回答的XML格式评分(0-0.5之间)
    """
    # 从completions中提取出模型的实际回答文本
    contents = [completion[0]["content"] for completion in completions]
    # 对每个回答文本评估XML标签使用情况
    return [count_xml(c) for c in contents]

# 添加一个检查 思考过程的文本和最后的文本相似度的函数，确保结果不会和思考过程相同

def reasoning_length_reward_func(completions, max_length=1024, **kwargs) -> list[float]:
    """
    奖励推理文本长度，文本越长奖励越高，最高5分
    
    奖励与推理文本的长度呈线性关系，直到达到max_length个字符，
    之后将给予满分5.0分。
    
    参数:
        completions: 模型生成的完成内容列表
        max_length: 获得最高奖励的字符数（默认：500）
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 基于推理文本长度的奖励（0.0到5.0之间）
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for response in responses:
        try:
            # 提取<reasoning>和</reasoning>标签之间的文本
            reasoning_match = re.search(r"<reasoning>(.*?)</reasoning>", response, re.DOTALL)
            
            if reasoning_match:
                reasoning_text = reasoning_match.group(1).strip()
                text_length = len(reasoning_text)
                
                # 线性缩放：reward = 5.0 * min(1.0, text_length / max_length)
                # 在max_length字符或更多时给予满分5.0分
                reward = 2 * min(0.0, text_length / max_length)
                
                # 打印调试信息
                # print(f"推理长度: {text_length} 字符, 奖励: {reward:.2f}")
                
                rewards.append(reward)
            else:
                # 未找到reasoning标签
                rewards.append(0.0)
        except Exception as e:
            # 处理过程中出错
            rewards.append(0.0)
    
    return rewards




Dataset keys: dict_keys(['instruction', 'input', 'output'])


Map:   0%|          | 0/2008 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [6]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # 使用vLLM进行推理加速，显著提高生成和评估速度
    learning_rate = 5e-6, # 学习率设置为5e-6，适合LoRA微调大型语言模型
    adam_beta1 = 0.9, # Adam优化器的beta1参数，控制一阶矩估计的指数衰减率
    adam_beta2 = 0.99, # Adam优化器的beta2参数，控制二阶矩估计的指数衰减率，
    weight_decay = 0.1, # 权重衰减系数，用于L2正则化，防止过拟合
    warmup_ratio = 0.1, # 学习率预热比例，在训练初期逐渐增加学习率，占总训练步数的10%
    lr_scheduler_type = "cosine", # 学习率调度器类型，余弦退火可以平滑地降低学习率
    optim = "paged_adamw_8bit", # 优化器类型，使用8位量化的Adam优化器减少内存占用
    logging_steps = 1, # 每步训练后记录日志，便于实时监控训练状态
    bf16 = is_bfloat16_supported(), # 如果支持bfloat16则启用，提高训练速度并减少内存使用
    fp16 = not is_bfloat16_supported(), # 当不支持bfloat16时，使用fp16混合精度训练
    per_device_train_batch_size = 1, # 每个设备的训练批量大小，GRPO会自动调整为匹配num_generations
    gradient_accumulation_steps = 4, # 梯度累积步数，1表示每步更新一次模型参数（可增加到4以稳定训练） #1
    num_generations = 6, # 每次评估生成的样本数量，影响多样性和内存使用
    max_prompt_length = 1024, # 输入提示的最大长度（token数），超过会被截断
    max_completion_length = 1024,  # 生成文本的最大长度（token数），限制模型输出长度
    # num_train_epochs = 1, # 完整训练的轮数，当前被注释，使用max_steps控制训练长度
    max_steps = 100, # 训练的最大步数，100步为快速实验设置
    save_steps = 250, # 每250步保存一次检查点，用于恢复训练或评估
    max_grad_norm = 0.1, # 梯度裁剪阈值，防止梯度爆炸
    report_to = "none", # 训练过程报告工具，"none"表示不使用外部工具，可选用W&B等
    output_dir = "outputs", # 输出目录，用于保存模型、日志和检查点
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = GRPOTrainer(
    model = model, ## 传入预加载的模型，之前已使用LoRA方法准备好
    processing_class = tokenizer, # 传入分词器，用于文本处理和编码
    reward_funcs = [
        xmlcount_reward_func, # 检查XML标签的正确使用（<reasoning>和<answer>标签）并给予奖励
        soft_format_reward_func, # 宽松地检查回答是否符合XML格式，只要包含标签即可
        strict_format_reward_func, # 严格检查回答是否符合XML格式，包括换行和顺序
        int_reward_func, # 检查回答中的答案是否为整数并给予奖励
        # correctness_reward_func, # 与标准答案进行比较，评估回答的正确性
        reasoning_length_reward_func,
    ],
    # 训练参数配置，之前已定义
    args = training_args,
    # 训练数据集，已预处理成包含prompt和answer的格式
    train_dataset = dataset,
)

# 启动训练过程
# 模型会通过强化学习策略，根据上述奖励函数反馈不断调整生成策略
# 目标是学习生成符合XML格式的回答，包含推理过程和最终答案
trainer.train()

<a name="Inference"></a>
### Inference

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "给我创建一个有关于鸟的 glsl 代码"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

NameError: name 'tokenizer' is not defined

In [7]:
print(output)

首先，我们来分析给定方程 \(\sqrt{a} - \sqrt{a} + x = x\)。

这个方程可以简化为：\(0 + x = x\) 或者简写为 \(x = x\)。这看起来像是一个恒等式，它对于所有定义域内的 \(x\) 都成立，因此在某种意义上，这表示方程对于所有 \(x\) 都是正确的。但是，我们需要考虑到原始方程中 \(a > 1\) 的条件。实际上，原始方程简化后的等式 \(x = x\) 并没有提供关于 \(x\) 的额外限制，所以它在 \(x\) 的任何值上都成立。

给定 \(a > 1\) 并不会影响到 \(x = x\) 的结论，因为在 \(x = x\) 的情况下，\(x\) 可以是任何实数。因此，没有特定的 \(x\) 值被排除在可能的解之外，说明这个方程的实数解的集合是无限的，它包含了所有的实数。

所以，如果方程 \(\sqrt{a} - \sqrt{a} + x = x\) 的解是所有 \(x\) 的实数，那么解的和依然是所有实数的和，而在数学中，所有实数的和并不存在一个具体的数值，它是未定义的。

总结来说，当 \(a > 1\) 时，方程 \(\sqrt{a} - \sqrt{a} + x = x\) 的实数解之和为未定义。


And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [8]:
model.save_lora("grpo_saved_lora") # 保存LoRA

Now we load the LoRA and test:

In [9]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "如果 a > 1，则 √︁ a−√ a + x = x 的实数解之和等于?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.81s/it, est. speed input: 12.49 toks/s, output: 67.43 toks/s]


'<reasoning>\n要解决这个问题，首先需要理解和简化给定的方程 \\(\\sqrt{a} - \\sqrt{a} + x = x\\)。我们可以注意到 \\(\\sqrt{a} - \\sqrt{a} = 0\\)，这意味着原方程简化为 \\(0 + x = x\\)，即 \\(x = x\\)。这意味着原方程对于所有实数 \\(x\\) 都成立，这意味着任何实数都是这个方程的解。由于该方程对于所有 \\(x\\) 都是成立的，所以实数解的集合包含了所有实数。如果要求解的实数解之和，由于实数集合包含所有的实数，而实数集合没有一个确定的和，因此解的和将没有一个明确的数值。但若严格按照题意求所有可能的实数解之和，结果将为所有实数的平均值，这在现实中是不存在的。但是，根据题目的逻辑，实际上每个实数解相加的结果还是保持不变，也就是说，原方程给定条件不影响结果，解仍然是所有实数，且没有一个具体的实数和。考虑到以上情况，我们可以得出结论实数解之和为0，由于题目没有明确限定 \\(x\\) 的范围，假设 \\(x\\) 的取值从负无穷大到正无穷大，实数解之和可以理解为所有 \\(x\\) 的取值相加为0（即中性值）。\n</reasoning>\n<answer>\n0\n</answer>\n'

In [10]:
print(output)

<reasoning>
要解决这个问题，首先需要理解和简化给定的方程 \(\sqrt{a} - \sqrt{a} + x = x\)。我们可以注意到 \(\sqrt{a} - \sqrt{a} = 0\)，这意味着原方程简化为 \(0 + x = x\)，即 \(x = x\)。这意味着原方程对于所有实数 \(x\) 都成立，这意味着任何实数都是这个方程的解。由于该方程对于所有 \(x\) 都是成立的，所以实数解的集合包含了所有实数。如果要求解的实数解之和，由于实数集合包含所有的实数，而实数集合没有一个确定的和，因此解的和将没有一个明确的数值。但若严格按照题意求所有可能的实数解之和，结果将为所有实数的平均值，这在现实中是不存在的。但是，根据题目的逻辑，实际上每个实数解相加的结果还是保持不变，也就是说，原方程给定条件不影响结果，解仍然是所有实数，且没有一个具体的实数和。考虑到以上情况，我们可以得出结论实数解之和为0，由于题目没有明确限定 \(x\) 的范围，假设 \(x\) 的取值从负无穷大到正无穷大，实数解之和可以理解为所有 \(x\) 的取值相加为0（即中性值）。
</reasoning>
<answer>
0
</answer>



Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
