To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

In [None]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # 最大序列长度，可以增加以支持更长的推理文本
lora_rank = 16 # LoRA的秩，数值越大模型越智能但训练速度越慢 # 16

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/home/projects/unsloth-training/models/Qwen2.5-3B-Instruct", # 预训练模型的路径，或者hf名称
    max_seq_length = max_seq_length, # 设置最大序列长度
    load_in_4bit = False, # 使用4位量化加载模型，可以节省显存 #True
    fast_inference = True,# 启用vLLM加速推理
    max_lora_rank = lora_rank, # 设置LoRA的最大秩
    gpu_memory_utilization = 0.9, # GPU内存使用率，如果出现OOM可以降低此值
)

model = FastLanguageModel.get_peft_model(
    model, # 预先加载模型
    r = lora_rank, # LoRA的秩，建议值为8, 16, 32, 64或128
    target_modules = ["gate_proj", "up_proj", "down_proj",], # 需要应用LoRA的目标模块
    lora_alpha = lora_rank, # LoRA缩放参数，通常设为与r相同
    use_gradient_checkpointing = "unsloth", # 启用梯度检查点以支持长文本微调
    random_state = 3407,  # 随机数种子，确保结果可重现
)

INFO 03-11 13:42:26 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.8: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4070 Ti SUPER. Num GPUs = 1. Max memory: 15.992 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /home/projects/unsloth-training/models/Qwen2.5-3B-Instruct with actual GPU utilization = 73.56%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 15.99 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 5.96 GB. Also swap space = 2 GB.
INFO 03-11 13:43:10 config.py:549] This model supports multiple tasks: {'classif



Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 03-11 13:43:25 model_runner.py:1115] Loading model weights took 5.7701 GB
INFO 03-11 13:43:25 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-11 13:43:27 worker.py:267] Memory profiling takes 1.31 seconds
INFO 03-11 13:43:27 worker.py:267] the current vLLM instance can use total_gpu_memory (15.99GiB) x gpu_memory_utilization (0.74) = 11.76GiB
INFO 03-11 13:43:27 worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 4.90GiB.
INFO 03-11 13:43:27 executor_base.py:111] # cuda blocks: 8918, # CPU blocks: 3640
INFO 03-11 13:43:27 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 139.34x
INFO 03-11 13:43:27 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error o

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:15<00:00,  1.76it/s]

INFO 03-11 13:43:43 model_runner.py:1562] Graph capturing finished in 15 secs, took 0.37 GiB
INFO 03-11 13:43:43 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 17.34 seconds



Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.3.8 patched 36 layers with 36 QKV layers, 0 O layers and 36 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
# 导入必要的库
import re  # 导入正则表达式库，用于字符串匹配和提取
from datasets import load_dataset, Dataset  # 导入数据集处理相关库

# 定义系统提示，指定响应格式
SYSTEM_PROMPT = """
你是一个 glsl 文件制造机，会根据用户的问题生成可运行的完整代码，你不会使用任何外部资源例如贴图，而是使用纯程序化的生成。并且你的回答遵循以下格式：
<reasoning>
...
</reasoning>
<code>
...
</code>
"""

# 定义XML格式的思维链(Chain of Thought)格式模板
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<code>
{answer}
</code>
"""

def extract_xml_answer(text: str) -> str:
    """
    从包含XML标签的文本中提取<answer>标签内的答案
    
    参数:
        text: 包含XML标签的文本
        
    返回:
        str: 提取出的答案文本，去除首尾空格
    """
    answer = text.split("<code>")[-1]  # 提取<answer>标签后的内容
    answer = answer.split("</code>")[0]  # 提取</answer>标签前的内容
    return answer.strip()  # 去除首尾空格

def extract_hash_answer(text: str) -> str | None:
    """
    从文本中提取####标记后的答案（用于处理某些特定格式的数据）
    
    参数:
        text: 包含####标记的文本
        
    返回:
        str | None: 提取出的答案文本或None（如果没有####标记）
    """
    if "####" not in text:  # 检查文本中是否有####标记
        return None
    return text.split("####")[1].strip()  # 提取####标记后的内容并去除首尾空格

# 加载数据集的函数
def get_gsm8k_questions(split = "train", local_path="/home/projects/unsloth-training/datasets/ruozhiba_R1/alpaca_output.jsonl") -> Dataset:
    """
    从本地路径加载数据集并进行处理
    
    参数:
        split: 数据集分割，默认为"train"
        local_path: 本地数据集路径
        
    返回:
        Dataset: 处理后的数据集对象
    """
    # 从本地路径加载数据集
    data = load_dataset('json', data_files=local_path, split=split)
    
    # 检查数据集结构，打印第一个样本的键
    example = data[0]
    print("Dataset keys:", example.keys())
    
    # 对数据集进行映射处理，构建适合训练的格式
    data = data.map(lambda x: {
        'prompt': [
            # 添加系统提示作为第一条消息
            {'role': 'system', 'content': SYSTEM_PROMPT},
            # 添加用户问题，优先使用'instruction'字段，如不存在则尝试其他字段
            {'role': 'user', 'content': x['instruction'] if 'instruction' in x else x.get('input', '')}
        ],
        # 在这里，我们不定义答案
    })
    return data

# 加载并处理数据集
dataset = get_gsm8k_questions(local_path="/home/projects/unsloth-training/datasets/ruozhiba_R1/alpaca_output.jsonl")

# 以下是各种奖励函数的定义，用于评估模型生成的回答质量

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """
    评估模型回答的正确性，与标准答案进行比较
    
    参数:
        prompts: 提供给模型的问题列表
        completions: 模型生成的完成内容列表
        answer: 标准答案列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 正确回答得2.0分，不正确得0.0分
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]['content'] for completion in completions]
    # 获取当前问题文本
    q = prompts[0][-1]['content']
    # 从回答中提取XML标记的答案内容
    extracted_responses = [extract_xml_answer(r) for r in responses]
    # 打印调试信息，显示问题、正确答案、模型回答和提取的答案
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    # 比较提取的答案与标准答案，正确则返回2.0，错误则返回0.0
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

# def int_reward_func(completions, **kwargs) -> list[float]:
    """
    检查模型回答是否为整数
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 回答为整数得0.5分，否则得0.0分
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]['content'] for completion in completions]
    # 从回答中提取XML标记的答案内容
    extracted_responses = [extract_xml_answer(r) for r in responses]
    # 检查提取的答案是否为数字字符串，是则返回0.5，否则返回0.0
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    严格检查回答是否符合指定的XML格式
    
    格式要求: 必须严格遵循以下格式
    <reasoning>
    [推理内容，可多行]
    </reasoning>
    <code>
    [答案内容，可多行]
    </code>
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 格式正确得0.5分，否则得0.0分
    """
    # 定义严格的XML格式正则表达式模式，要求精确匹配开始和结束标签以及换行
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<code>\n.*?\n</code>\n$"
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    # 使用正则表达式检查格式是否匹配
    matches = [re.match(pattern, r) for r in responses]
    # 匹配成功返回0.5，否则返回0.0
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """
    宽松检查回答是否符合XML格式
    
    格式要求: 只要包含<reasoning>标签和<answer>标签即可，不严格要求换行和顺序
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 格式正确得0.5分，否则得0.0分
    """
    # 定义宽松的XML格式正则表达式模式，只要求包含标签，不限制换行格式
    pattern = r"<reasoning>.*?</reasoning>\s*<code>.*?</code>"
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    # 使用正则表达式检查格式是否匹配
    matches = [re.match(pattern, r) for r in responses]
    # 匹配成功返回0.5，否则返回0.0
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    """
    计算XML标签的正确使用情况，并给予分数奖励
    
    参数:
        text: 需要评估的文本
        
    返回:
        float: 根据XML标签的正确使用情况计算的分数(最高0.5分)
    """
    count = 0.0
    # 检查是否正确使用<reasoning>标签，正确得0.125分
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    # 检查是否正确使用</reasoning>标签，正确得0.125分
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    # 检查是否正确使用<answer>标签，正确得0.125分
    # 同时减去</answer>后多余内容的惩罚分
    if text.count("\n<code>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</code>\n")[-1])*0.001  # 对多余内容进行惩罚
    # 检查是否正确使用</answer>标签，正确得0.125分
    # 同时减去</answer>后多余内容的惩罚分
    if text.count("\n</code>") == 1:
        count += 0.125
        count -= (len(text.split("\n</code>")[-1]) - 1)*0.001  # 对多余内容进行惩罚
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """
    评估回答中XML标签的正确使用情况
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 每个回答的XML格式评分(0-0.5之间)
    """
    # 从completions中提取出模型的实际回答文本
    contents = [completion[0]["content"] for completion in completions]
    # 对每个回答文本评估XML标签使用情况
    return [count_xml(c) for c in contents]


def glsl_validation_reward_func(prompts, completions, **kwargs) -> list[float]:
    """
    验证<code>标签中的GLSL代码是否有语法错误
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 通过验证得1.0分，否则得0.0分
    """
    import subprocess
    import tempfile
    import os
    from pathlib import Path
    
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for response in responses:
        try:
            # 提取<code>和</code>标签之间的文本
            code_match = re.search(r"<code>(.*?)</code>", response, re.DOTALL)
            
            if not code_match:
                rewards.append(0.0)
                continue
                
            code = code_match.group(1).strip()
            
            # 创建临时文件保存GLSL代码
            with tempfile.NamedTemporaryFile(suffix='.glsl', delete=False) as tmp:
                tmp_path = Path(tmp.name)
                # 添加一个基本的main函数如果没有
                if "void main()" not in code:
                    code = f"void main() {{\n    {code}\n}}"
                tmp.write(code.encode('utf-8'))
            
            # 使用glslangValidator验证代码
            # 可以根据您的环境调整命令路径
            try:
                result = subprocess.run(
                    ['glslangValidator', '-S', 'frag', str(tmp_path)],
                    capture_output=True,
                    text=True,
                    timeout=10  # 设置超时，防止卡住
                )
                #     # 打印调试信息，显示问题、模型回答和提取的答案
                q = prompts[0][-1]['content']
                print('-'*20, f"Question:\n{q}", f"\nResponse:\n{responses[0]}" + f"\nglsl Test: {result.returncode}")
                # 如果返回码为0，则代码合法
                reward = 3.0 if result.returncode == 0 else -0.5
                rewards.append(reward)
            except (subprocess.SubprocessError, FileNotFoundError):
                # 工具不存在或执行错误
                print(f"GLSL验证执行错误: {e}")
                rewards.append(0.0)
            finally:
                # 清理临时文件
                os.unlink(tmp_path)
                
        except Exception as e:
            print(f"GLSL验证出错: {e}")
            rewards.append(0.0)
    
    return rewards


# def glsl_code_detection_reward_func(completions, **kwargs) -> list[float]:
    """
    检测<code>标签中的内容是否看起来像GLSL代码
    
    参数:
        completions: 模型生成的完成内容列表
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 代码看起来像GLSL得1.0分，否则得0.0分
    """
    import re
    
    # GLSL特有的关键字和类型列表
    glsl_keywords = [
        'void', 'float', 'vec2', 'vec3', 'vec4', 'mat2', 'mat3', 'mat4', 
        'uniform', 'varying', 'attribute', 'in', 'out', 'inout', 
        'sampler2D', 'samplerCube', 'const', 'gl_Position', 'gl_FragColor',
        'gl_FragCoord', 'texture2D', 'texture', 'normalize', 'reflect',
        'refract', 'length', 'distance', 'dot', 'cross', 'mix', 'smoothstep',
        'clamp', 'fract', 'mod', 'sin', 'cos', 'tan', 'asin', 'acos', 'atan',
        'pow', 'exp', 'log', 'sqrt', 'inversesqrt', 'abs', 'sign', 'floor', 
        'ceil', 'min', 'max', 'step', 'dFdx', 'dFdy'
    ]
    
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for response in responses:
        try:
            # 提取<code>和</code>标签之间的文本
            code_match = re.search(r"<code>(.*?)</code>", response, re.DOTALL)
            
            if not code_match:
                rewards.append(0.0)
                continue
                
            code = code_match.group(1).strip()
            
            # 如果代码过短，可能不是有效的GLSL代码
            if len(code) < 30:
                rewards.append(0.2)  # 给予较低的分数
                continue
                
            # 计算GLSL特征分数
            score = 0.0
            
            # 检查关键字匹配度
            keyword_count = 0
            for keyword in glsl_keywords:
                # 使用单词边界\b确保匹配整个关键字
                pattern = r'\b' + re.escape(keyword) + r'\b'
                if re.search(pattern, code):
                    keyword_count += 1
            
            # 根据匹配的关键字数量给分，最多0.5分
            keyword_score = min(0.5, keyword_count / 5)
            score += keyword_score
            
            # 检查GLSL的特征结构
            structure_score = 0.0
            
            # 检查是否包含类型定义后跟变量名的模式
            if re.search(r'\b(vec[234]|mat[234]|float|int|bool)\s+\w+', code):
                structure_score += 0.2
            
            # 检查是否包含函数定义
            if re.search(r'\b(void|float|vec[234]|mat[234]|int|bool)\s+\w+\s*\([^)]*\)\s*\{', code):
                structure_score += 0.2
            
            # 检查是否包含典型的main函数
            if re.search(r'void\s+main\s*\(\s*\)\s*\{', code):
                structure_score += 0.3
                
            # 检查一些典型的GLSL运算符使用
            if re.search(r'(gl_\w+|texture2D|textureCube)\s*\(', code):
                structure_score += 0.3
            
            score += min(0.5, structure_score)
            
            # 最终得分，大于0.6认为是GLSL代码
            final_reward = 1.0 if score >= 0.6 else score
            rewards.append(final_reward)
            
        except Exception as e:
            print(f"GLSL代码检测出错: {e}")
            rewards.append(0.0)
    
    return rewards

def reasoning_length_reward_func(completions, max_length=1024, **kwargs) -> list[float]:
    """
    奖励推理文本长度，文本越长奖励越高，最高1分
    
    奖励与推理文本的长度呈线性关系，直到达到max_length个字符，
    之后将给予满分2分。
    
    参数:
        completions: 模型生成的完成内容列表
        max_length: 获得最高奖励的字符数（默认：500）
        **kwargs: 额外的关键字参数
        
    返回:
        list[float]: 基于推理文本长度的奖励（0.0到1.0之间）
    """
    # 从completions中提取出模型的实际回答文本
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for response in responses:
        try:
            # 提取<reasoning>和</reasoning>标签之间的文本
            reasoning_match = re.search(r"<reasoning>(.*?)</reasoning>", response, re.DOTALL)
            
            if reasoning_match:
                reasoning_text = reasoning_match.group(1).strip()
                text_length = len(reasoning_text)
                
                # 线性缩放：reward = 5.0 * min(1.0, text_length / max_length)
                # 在max_length字符或更多时给予满分5.0分
                reward = 1 * min(1.0, text_length / max_length)
                
                # 打印调试信息
                # print(f"推理长度: {text_length} 字符, 奖励: {reward:.2f}")
                
                rewards.append(reward)
            else:
                # 未找到reasoning标签
                rewards.append(0.0)
        except Exception as e:
            # 处理过程中出错
            rewards.append(0.0)
    
    return rewards


Dataset keys: dict_keys(['instruction', 'input', 'output'])


Map:   0%|          | 0/554 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # 使用vLLM进行推理加速，显著提高生成和评估速度
    learning_rate = 5e-6, # 学习率设置为5e-6，适合LoRA微调大型语言模型
    adam_beta1 = 0.9, # Adam优化器的beta1参数，控制一阶矩估计的指数衰减率
    adam_beta2 = 0.99, # Adam优化器的beta2参数，控制二阶矩估计的指数衰减率，
    weight_decay = 0.1, # 权重衰减系数，用于L2正则化，防止过拟合
    warmup_ratio = 0.1, # 学习率预热比例，在训练初期逐渐增加学习率，占总训练步数的10%
    lr_scheduler_type = "cosine", # 学习率调度器类型，余弦退火可以平滑地降低学习率
    optim = "paged_adamw_8bit", # 优化器类型，使用8位量化的Adam优化器减少内存占用
    logging_steps = 1, # 每步训练后记录日志，便于实时监控训练状态
    bf16 = is_bfloat16_supported(), # 如果支持bfloat16则启用，提高训练速度并减少内存使用
    fp16 = not is_bfloat16_supported(), # 当不支持bfloat16时，使用fp16混合精度训练
    per_device_train_batch_size = 1, # 每个设备的训练批量大小，GRPO会自动调整为匹配num_generations
    gradient_accumulation_steps = 4, # 梯度累积步数，1表示每步更新一次模型参数（可增加到4以稳定训练） #1
    num_generations = 6, # 每次评估生成的样本数量，影响多样性和内存使用
    max_prompt_length = 256, # 输入提示的最大长度（token数），超过会被截断
    max_completion_length = 2048,  # 生成文本的最大长度（token数），限制模型输出长度
    # num_train_epochs = 1, # 完整训练的轮数，当前被注释，使用max_steps控制训练长度
    max_steps = 1000, # 训练的最大步数，100步为快速实验设置
    save_steps = 250, # 每250步保存一次检查点，用于恢复训练或评估
    max_grad_norm = 0.1, # 梯度裁剪阈值，防止梯度爆炸
    report_to = "tensorboard", # 训练过程报告工具，"none"表示不使用外部工具，可选用W&B等
    output_dir = "outputs_1", # 输出目录，用于保存模型、日志和检查点
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = GRPOTrainer(
    model = model, ## 传入预加载的模型，之前已使用LoRA方法准备好
    processing_class = tokenizer, # 传入分词器，用于文本处理和编码
    reward_funcs = [
        xmlcount_reward_func, # 检查XML标签的正确使用（<reasoning>和<answer>标签）并给予奖励
        soft_format_reward_func, # 宽松地检查回答是否符合XML格式，只要包含标签即可
        strict_format_reward_func, # 严格检查回答是否符合XML格式，包括换行和顺序
        # int_reward_func, # 检查回答中的答案是否为整数并给予奖励
        # correctness_reward_func, # 与标准答案进行比较，评估回答的正确性
        glsl_validation_reward_func,
        reasoning_length_reward_func,
        # glsl_code_detection_reward_func,
    ],
    # 训练参数配置，之前已定义
    args = training_args,
    # 训练数据集，已预处理成包含prompt和answer的格式
    train_dataset = dataset,
)

# 启动训练过程
# 模型会通过强化学习策略，根据上述奖励函数反馈不断调整生成策略
# 目标是学习生成符合XML格式的回答，包含推理过程和最终答案
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 554 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 4 x 1) = 24
 "-____-"     Trainable parameters = 22,560,768/3,108,499,456 (0.73% trained)


-------------------- Question:
请帮我生成印度尼西亚峇里岛的海神庙日落shader代码 
Response:
为了创建一款简单的蒸汽波风格的Shader，我们可以通过调整着色器的输出颜色来实现一种复古、波纹的效果。下面的代码可以作为创建你的蒸汽波效果的起点：
```python
<reasoning>
根据用户需求的蒸汽波效果，我们提出了一个着色器，该着色器使顶点位置可以通过简单的变换，例如缩放和平移，而创建一些波纹效果。我们还通过旋转顶点来增加一些动态感。对于颜色，我们只改变其中的蓝色分量，以模拟蒸汽波色。在片段着色器中，返回一个带有轻微位移的顶点，以便在输出像素中形成类似于波纹的效果。
</reasoning>
<code>
#version 330 core
in vec3 position;
uniform vec3 rotation = vec3(1.0, 1.0, 1.0);
uniform mat4 model;
uniform mat4 view;
uniform mat4 projection;
out vec4 color;
void main(){
    // Rotates the vertex by the defined rotation amount.
    mat4 rot = mat4(cos(rotation.x * 0.5), sin(rotation.x * 0.5), 0, 0, -sin(rotation.x * 0.5), cos(rotation.x * 0.5), 0, 0, 0, 0, 1, 0, 0, 0, 0, 1);
    mat4 rot_y = mat4(cos(rotation.y * 0.5), 0, -sin(rotation.y * 0.5), 0, 0, 1, 0, 0, sin(rotation.y * 0.5), 0, cos(rotation.y * 0.5), 0, 0, 0, 0, 1);
    mat4 rot_z = mat4(cos(rotation.z * 0.5), 0, 0, sin(rotation.z * 0.5), 0, 1, 0, -sin(rotation.z * 0.5), 0, 0, cos(rotation.z * 0.

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / glsl_validation_reward_func,rewards / reasoning_length_reward_func,rewards / glsl_code_detection_reward_func
1,0.0003,-0.019885,0.373775,660.583344,0.006789,-0.140083,0.0,0.0,-0.083333,0.036865,0.166667
2,0.0003,-0.044036,0.280802,666.708359,0.007089,-0.082,0.0,0.0,-0.020833,0.017131,0.041667
3,0.0006,0.19599,0.484733,616.125015,0.01618,-0.071792,0.0,0.0,0.083333,0.059448,0.125
4,0.0002,0.0,0.0,582.083344,0.004443,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0001,0.317964,0.598919,601.875031,0.003355,-0.105417,0.0,0.0,0.1875,0.027547,0.208333
6,0.0002,0.158777,0.298948,673.083344,0.00466,-0.043208,0.0,0.0,0.104167,0.014486,0.083333
7,0.0001,-0.01377,0.263631,649.541687,0.002951,-0.075375,0.0,0.0,-0.041667,0.019938,0.083333
8,0.0001,0.01169,0.180634,651.791687,0.002515,-0.08975,0.0,0.0,-0.041667,0.059774,0.083333
9,0.0002,0.15815,0.576848,623.416687,0.003885,-0.076917,0.0,0.0,0.083333,0.026733,0.125
10,0.0001,0.411445,0.879498,604.208344,0.002921,-0.052667,0.0,0.0,0.1875,0.026611,0.25


-------------------- Question:
请帮我生成水晶生长过程的shader代码 
Response:
生成一个模拟水晶生长过程的GLSL shader代码是一个复杂的任务，因为它涉及到物理过程的模拟和可视化。为了简化这个过程，我将尝试创造一个简单的模拟模型，使用一个由极简几何图形构成的“水晶”，并添加一些基本的生长动画效果。这一部分将包括基本的色散和时间演化，展示一种生长效果。实际上，这样的效果在真实世界中会有非常复杂和细致的物理处理，但以下代码提供了很基本的可视化效果。

```glsl
// 晶体生长的GLSL Shader
#version 330 core

// 输入和输出变量
layout (location = 0) in vec3 VertexPosition;
layout (location = 1) in vec2 vertexUV;
out vec2 TexCoords;

// 模型的变换矩阵和光源的位置
uniform mat4 _Model;
uniform mat4 _View;
uniform mat4 _Projection;
uniform vec3 _LightPos;

// 模拟生长的参数
uniform float time;

void main() {
    // Position and UV generation
    // Here we make a very simple crystal shape for illustration
    TexCoords = vertexUV;
    float radius = 0.25 + 0.5 * sin(time * 2.0 + VertexPosition.x * 2.0);

    // Create a slight gradient around the center
    float gradientStrength = 0.1;
    float gradientDistance = 0.5;
    float distanceToCenter = length(VertexPosition);
    float gradient = (gradientDistance - distanceToCenter

TrainOutput(global_step=100, training_loss=0.0003112851825062535, metrics={'train_runtime': 5384.2483, 'train_samples_per_second': 0.446, 'train_steps_per_second': 0.019, 'total_flos': 0.0, 'train_loss': 0.0003112851825062535})

<a name="Inference"></a>
### Inference

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [10]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "请帮我生成辉光效果的shader代码"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.09s/it, est. speed input: 4.70 toks/s, output: 83.49 toks/s]


'生成辉光效果的着色器代码可以用于增强场景的视觉效果，比如突出灯光或者物体的边缘。以下是一个简单的辉光效果着色器代码示例。这个示例假设我们使用的是OpenGL ES 2.0。在实际应用中，你可能需要根据具体的需求和平台进行调整。\n\n### 顶点着色器 (Vertex Shader)\n\n```glsl\nattribute vec4 position;\nattribute vec4 color;\n\nuniform mat4 projection;\nuniform mat4 view;\n\nvarying vec4 vColor;\n\nvoid main() {\n    gl_Position = projection * view * position;\n    vColor = color;\n}\n```\n\n### 片段着色器 (Fragment Shader)\n\n```glsl\nprecision mediump float;\n\nvarying vec4 vColor;\n\nuniform vec3 lightPos;\nuniform vec3 eyePos;\nuniform vec3 ambient;\nuniform vec3 diffuse;\nuniform vec3 specular;\nuniform float shininess;\n\nvoid main() {\n    // 从光源到片段的距离\n    vec3 lightDir = normalize(lightPos - gl_FragCoord.xyz);\n    vec3 viewDir = normalize(eyePos - gl_FragCoord.xyz);\n    vec3 normal = normalize(gl_NormalMatrix * gl_Normal);\n\n    // 计算环境光\n    vec3 ambient = ambient * vColor.rgb;\n\n    // 计算漫反射\n    vec3 lightDiffuse = diffuse * max(dot(normal, lightDir), 0.0);\n    vec3 lightDiffuseAttenuation = li

In [11]:
print(output)

生成辉光效果的着色器代码可以用于增强场景的视觉效果，比如突出灯光或者物体的边缘。以下是一个简单的辉光效果着色器代码示例。这个示例假设我们使用的是OpenGL ES 2.0。在实际应用中，你可能需要根据具体的需求和平台进行调整。

### 顶点着色器 (Vertex Shader)

```glsl
attribute vec4 position;
attribute vec4 color;

uniform mat4 projection;
uniform mat4 view;

varying vec4 vColor;

void main() {
    gl_Position = projection * view * position;
    vColor = color;
}
```

### 片段着色器 (Fragment Shader)

```glsl
precision mediump float;

varying vec4 vColor;

uniform vec3 lightPos;
uniform vec3 eyePos;
uniform vec3 ambient;
uniform vec3 diffuse;
uniform vec3 specular;
uniform float shininess;

void main() {
    // 从光源到片段的距离
    vec3 lightDir = normalize(lightPos - gl_FragCoord.xyz);
    vec3 viewDir = normalize(eyePos - gl_FragCoord.xyz);
    vec3 normal = normalize(gl_NormalMatrix * gl_Normal);

    // 计算环境光
    vec3 ambient = ambient * vColor.rgb;

    // 计算漫反射
    vec3 lightDiffuse = diffuse * max(dot(normal, lightDir), 0.0);
    vec3 lightDiffuseAttenuation = lightDiffuse * (1.0 / (dot(lightDir, lightDir) 

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [12]:
model.save_lora("grpo_saved_lora") # 保存LoRA

Now we load the LoRA and test:

In [13]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "请帮我生成辉光效果的shader代码"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.38s/it, est. speed input: 10.98 toks/s, output: 78.22 toks/s]


'辉光效果的实现可以通过渲染多个光源的颜色信息到一个纹理，并在最终的渲染过程中对每个像素进行颜色混合来实现。以下是一个简单的GLSL辉光效果的Shader代码示例，该代码假设光源是点光源，且有多个光源需要渲染到辉光纹理中。我们在这里只考虑一个光源的情况，如果有多个光源，可以扩展代码以支持所有光源。\n\n辉光纹理将用于记录每个像素从光源处接收到的光强度，这可以通过将光源的位置和颜色信息映射到辉光纹理的像素上来实现。\n\n```glsl\n#version 330 core\n\n// 颜色和位置信息由外部提供\nin vec3 lightColor;\nin vec3 lightPosition;\nin vec3 lightIntensity;\nin vec2 TexCoord;\n\n// 辉光纹理的名称\nuniform sampler2D glowTexture;\n\n// 辉光纹理的宽度和高度\nuniform int glowTextureWidth;\nuniform int glowTextureHeight;\n\nout vec4 FragColor;\n\nvoid main()\n{\n    // 计算从光源到每个像素的方向\n    vec3 lightDirection = lightPosition - gl_FragCoord.xy / glowTextureWidth;\n\n    // 计算辉光值，辉光值越高，辉光效果越强\n    float glowValue = dot(lightDirection, normalize(lightDirection)) * lightIntensity;\n\n    // 获取辉光纹理中的颜色信息\n    vec4 glowColor = texture(glowTexture, TexCoord);\n\n    // 用辉光值混合辉光颜色\n    FragColor = vec4(glowColor.rgb * glowValue + lightColor, 1.0);\n}\n```\n\n为了使用这个Shader，你需要为每个光源提供颜色、位置和强度信息。这些数据需要通过Vertex Shader传递到这个Fragment Shader中。\n\nVertex Sh

In [14]:
print(output)

辉光效果的实现可以通过渲染多个光源的颜色信息到一个纹理，并在最终的渲染过程中对每个像素进行颜色混合来实现。以下是一个简单的GLSL辉光效果的Shader代码示例，该代码假设光源是点光源，且有多个光源需要渲染到辉光纹理中。我们在这里只考虑一个光源的情况，如果有多个光源，可以扩展代码以支持所有光源。

辉光纹理将用于记录每个像素从光源处接收到的光强度，这可以通过将光源的位置和颜色信息映射到辉光纹理的像素上来实现。

```glsl
#version 330 core

// 颜色和位置信息由外部提供
in vec3 lightColor;
in vec3 lightPosition;
in vec3 lightIntensity;
in vec2 TexCoord;

// 辉光纹理的名称
uniform sampler2D glowTexture;

// 辉光纹理的宽度和高度
uniform int glowTextureWidth;
uniform int glowTextureHeight;

out vec4 FragColor;

void main()
{
    // 计算从光源到每个像素的方向
    vec3 lightDirection = lightPosition - gl_FragCoord.xy / glowTextureWidth;

    // 计算辉光值，辉光值越高，辉光效果越强
    float glowValue = dot(lightDirection, normalize(lightDirection)) * lightIntensity;

    // 获取辉光纹理中的颜色信息
    vec4 glowColor = texture(glowTexture, TexCoord);

    // 用辉光值混合辉光颜色
    FragColor = vec4(glowColor.rgb * glowValue + lightColor, 1.0);
}
```

为了使用这个Shader，你需要为每个光源提供颜色、位置和强度信息。这些数据需要通过Vertex Shader传递到这个Fragment Shader中。

Vertex Shader示例如下：

```glsl
#version 330 core

in 

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# # Save to 8bit Q8_0
# if False: model.save_pretrained_gguf("model", tokenizer,)
# # Remember to go to https://huggingface.co/settings/tokens for a token!
# # And change hf to your username!
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# # Save to multiple GGUF options - much faster if you want multiple!
# if False:
#     model.push_to_hub_gguf(
#         "hf/model", # Change hf to your username!
#         tokenizer,
#         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#         token = "",
#     )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
