# 基于GRPO的Qwen与Llama强化学习微调

DeepSeek R1论文中，使用强化学习算法GRPO（Group Relative Policy Optimization）将Base模型训练为推理模型（如下图所示）。

![R1&R1Zero](./R1&R1Zero.png)


本实验**设置一个奖励模型，利用GRPO+LoRA训练，使用GSM8K数据集**（Grade School Math 8K是一个包含8.5K高质量、语言多样的小学数学数据集。该数据集内问题需要2～8步推理来解决，并给出了以自然语言表述的上述解题思路及答案）。

**最终将原始开源的Qwen2.5-7B、Llama 3.1-8B（非R1-Distill系列模型）微调为具备推理思考链的模型。** 无需全参微调，最低只需要显存>15GB的GPU（单卡A10/V100/RTX4090即可）


**GRPO 与其他强化学习方法的对比:** GRPO可以在不需要价值函数模型的情况下高效优化响应。与PPO（近端策略优化）等方法相比，可以减少了内存和计算成本。可以提升策略优化的稳定性、数据利用率和收敛速度。

| **方法**    | **主要特点** | **优点** | **缺点** |
|------------|------------|----------|----------|
| **REINFORCE** | 经典的策略梯度方法 | 理论简单，易于实现 | 方差大，收敛慢，样本利用率低 |
| **TRPO（Trust Region Policy Optimization）** | 采用 KL 散度约束策略更新 | 更新稳定，避免策略崩溃 | 计算开销大，优化复杂 |
| **PPO（Proximal Policy Optimization）** | 使用裁剪（clipping）进行策略优化 | 计算开销较小，优化更稳定 | 样本利用率低，策略更新受限 |
| **SAC（Soft Actor-Critic）** | 基于熵的策略优化 | 适用于连续控制任务，稳定性好 | 计算量较大，难以调参 |
| **DDPG（Deep Deterministic Policy Gradient）** | 采用 Actor-Critic 结构 | 适用于高维连续控制任务 | 训练不稳定，探索不足 |
| **TD3（Twin Delayed DDPG）** | DDPG 的改进版本，减少 Q 估计误差 | 改进了 DDPG，避免过估计 | 仍然需要大量调参 |
| **GRPO（Group Relative Policy Optimization）** | 结合 PPO 与 TRPO 优势，采用平滑目标函数 | 样本利用率高，训练稳定，计算开销适中 | 可能需要针对不同任务调节参数 |


## 一、环境配置

首次运行需要安装如下依赖包

In [1]:
!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
!pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting unsloth
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/71/ec8540316214c21777d2ffb3f0408d06da93358d0402070f2a90353be2b3/unsloth-2025.2.5-py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.2/181.2 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting vllm
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/e7/c0/5b7f019aa798dedfb44c30971e9becf3c6a2db7dde311570178fa66c49c8/vllm-0.7.2-cp38-abi3-manylinux1_x86_64.whl (264.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.3/264.3 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting bitsandbytes
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/db/9d/9382259196d7ad7f3550702390081224e673a705e75b5660ee377b592fc0/bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl (69.7 MB)
[2K    

## 二、训练框架配置

Unsloth是一个微调加速的框架，能够加速模型训练。在使用 GRPO 强化学习算法前，需要利用Unsloth加速框架对其进行一个“补丁”操作，调用 FastLanguageModel 进行训练加速。

In [1]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-11 18:00:25 __init__.py:190] Automatically detected platform cuda.


加载Qwen2.5-7B-Instruct或Llama 3.1 8B Instruct, 并且设置训练超参数

In [2]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/mnt/data/LLM_Model/Qwen2.5-7B-Instruct", # 提前下载模型后填写路径，也可直接填写meta-llama/meta-Llama-3.1-8B-Instruct、Qwen/Qwen2.5-3B-Instruct自动下载模型
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.8, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.2.5: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A10. Max memory: 21.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /mnt/data/LLM_Model/Qwen2.5-7B-Instruct with actual GPU utilization = 79.17%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 21.99 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 3.07 GB. Also swap space = 6 GB.
INFO 02-11 18:00:31 config.py:542] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 02-11 18:00:31 llm_engine.py:234] Initializing a V0 LLM



Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 02-11 18:00:35 model_runner.py:1115] Loading model weights took 14.3620 GB
INFO 02-11 18:00:35 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-11 18:00:36 worker.py:267] Memory profiling takes 1.76 seconds
INFO 02-11 18:00:36 worker.py:267] the current vLLM instance can use total_gpu_memory (21.99GiB) x gpu_memory_utilization (0.79) = 17.41GiB
INFO 02-11 18:00:36 worker.py:267] model weights take 14.36GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 0.88GiB; the rest of the memory reserved for KV Cache is 2.11GiB.
INFO 02-11 18:00:37 executor_base.py:110] # CUDA blocks: 2474, # CPU blocks: 7021
INFO 02-11 18:00:37 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 38.66x
INFO 02-11 18:00:39 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error 

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:11<00:00,  2.02it/s]

INFO 02-11 18:00:50 model_runner.py:1562] Graph capturing finished in 11 secs, took 1.26 GiB
INFO 02-11 18:00:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 15.83 seconds



Unsloth 2025.2.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data Prep"></a>
## 三、奖励函数定义


选用GSM8K（Grade School Math 8K）数据集 https://huggingface.co/datasets/openai/gsm8k ，利用脚本处理，并评估基于XML格式的回答生成任务的质量

In [3]:
import re
from datasets import load_dataset, Dataset

# 指导模型回答的系统提示（System Prompt）
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

# 规定了回答的格式，要求模型先写出推理过程 <reasoning>...</reasoning>，然后在 <answer>...</answer> 中提供答案。用于格式化模型的推理过程（Chain of Thought, CoT）和最终答案，便于评估。
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""


# 该函数从模型生成的回答中提取 <answer>...</answer> 之间的答案部分。
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

# GSM8K 的答案格式通常为 #### 42，该函数用于提取 #### 后面的标准答案。
def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# 加载 GSM8K 训练集数据。将数学问题转换成 Chat API 适配的格式。
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('/mnt/data/datasets/openai___gsm8k', 'default')[split] # type: ignore 手动下载数据集，config结构需要替换为default
    # data = load_dataset('openai/gsm8k', 'main')[split] # 默认下载数据集，有可能网络原因无法下载
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()


# 奖励函数（Reward Functions），用于衡量模型的回答质量
## 正确性评估：检查模型的 completions 结果是否等于标准 answer。奖励机制：完全正确 → 2.0 否则 → 0.0
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

## 格式检查，检查答案是否是整数
def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

## 严格 XML 结构检测
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

## 松散 XML 结构检测
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


## XML 标签完整性评分
def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
## 四、模型训练



取消分布式训练，避免环境变量等情况默认加载DDP

In [4]:
import os

for var in ["MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE"]:
    print(f"{var}={os.environ.get(var)}")

if torch.distributed.is_initialized():
    print("发现已初始化的分布式进程组，开始销毁...")
    torch.distributed.destroy_process_group()
    print("分布式进程组销毁完成。")

MASTER_ADDR=None
MASTER_PORT=None
RANK=None
WORLD_SIZE=None
发现已初始化的分布式进程组，开始销毁...
分布式进程组销毁完成。


In [5]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 300,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "/mnt/data/Unsloth/LLM_outputs/Qwen2.5-7B-Instruct",
)

开始训练！如下为训练过程的奖励表。目标是看到`reward`列增加！

可能需要等待150到200步才能看到reward有明显增加。前100步奖励可能接近于0，请耐心等待！

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [6]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 300
 "-____-"     Number of trainable parameters = 80,740,352


-------------------- Question:
Ahmed and Emily are having a contest to see who can get the best grade in the class. There have been 9 assignments and Ahmed has a 91 in the class. Emily has a 92. The final assignment is worth the same amount as all the other assignments. Emily got a 90 on the final assignment. What is the minimum grade Ahmed needs to get to beat Emily if all grades are whole numbers? 
Answer:
100 
Response:
<reasoning>
To find the minimum grade Ahmed needs to get to beat Emily, we first need to calculate Emily's final average after the final assignment, and then determine what score Ahmed needs to surpass this average.

Emily's current average is 92 after 9 assignments. To calculate her final average, we need to find the total points she has and then add the points from the final assignment, which was a 90.

1. Calculate Emily's total points before the final assignment.
2. Add Emily's score from the final assignment to this total.
3. Divide the new total by 10 to find E

Step,Training Loss,reward,reward_std,completion_length,kl
1,-0.0,0.09375,0.0625,200.0,0.0
2,-0.0,-0.13175,0.212107,193.5,0.0
3,0.0,-0.233,0.103541,194.25,0.000467
4,0.0,0.687,1.124,197.75,0.00054
5,0.0,0.125,0.0,200.0,0.0004
6,0.0,-0.09525,0.330492,200.0,0.000325
7,0.0,-0.1235,0.208078,187.0,0.000516
8,0.0,-0.16175,0.09142,190.5,0.000441
9,0.0,-0.07925,0.322922,195.75,0.000772
10,0.0,0.08925,0.113831,109.5,0.000504


-------------------- Question:
The gauge on a water tank shows that the tank is 1/3 full of water. To fill the tank, 16 gallons of water are added. How many gallons of water does the tank hold when full? 
Answer:
24 
Response:
<reasoning>
If the tank is 1/3 full and 16 gallons of water are added to make it full, then those 16 gallons correspond to the difference between 3/3 (full) and 1/3 (current level), which is 2/3 of the tank's capacity.

We can set up a proportion to find the full capacity of the tank:
Let \( x \) be the full capacity of the tank in gallons. Then, \( \frac{2}{3}x = 16 \) gallons.

To find \( x \), we solve the equation:
\[ x = 16 \times \frac{3}{2} \]

</reasoning>
<answer>
\[ x = 24 \]

The tank holds 24 gallons of water when full.
</answer> 
Extracted:
\[ x = 24 \]

The tank holds 24 gallons of water when full.
-------------------- Question:
There are 15 tables in the school's cafeteria. Each table can seat 10 people. Usually, only 1/10 of the seats are left uns

TrainOutput(global_step=300, training_loss=0.0009408841898948594, metrics={'train_runtime': 2520.3605, 'train_samples_per_second': 0.119, 'train_steps_per_second': 0.119, 'total_flos': 0.0, 'train_loss': 0.0009408841898948594})

<a name="Inference"></a>
## 五、模型推理


In [7]:
from vllm import SamplingParams
from IPython.display import display, Markdown

sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)


在对刚刚微调前的模型评测之前，首先尝试一下没有经过任何GRPO训练的原始模型：

In [39]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)


output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

# 将输出转为Markdown格式
display(Markdown(f"```\n{output}\n```"))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.83it/s, est. speed input: 67.67 toks/s, output: 27.43 toks/s]


```
There are two 'r's in the word "strawberry."
```

现在使用用GRPO训练的LoRA - 首先保存LoRA！

In [10]:
# model.save_lora("grpo_saved_lora_Qwen2.5-7B-Instruct")

Lora_path = "/mnt/data/Unsloth/LLM_outputs/LoraAdapters/Qwen2.5-7B-Instruct_GRPO"
Merge_model_path = "/mnt/data/Unsloth/LLM_outputs/vLLM/Qwen2.5-7B-Instruct_GRPO"

# Merge to 16bit
model.save_pretrained_merged(Merge_model_path, tokenizer, save_method = "merged_16bit",)
# Just LoRA adapters
model.save_pretrained_merged(Lora_path, tokenizer, save_method = "lora",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 267.3 out of 377.37 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  0%|          | 0/28 [00:00<?, ?it/s]
We will save to Disk and not RAM now.
100%|██████████| 28/28 [00:15<00:00,  1.77it/s]


Unsloth: Saving tokenizer... Done.
Done.
Unsloth: Saving tokenizer... Done.
 Done.h: Saving model...


In [46]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)


output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora(Lora_path),
)[0].outputs[0].text


# 将输出转为Markdown格式
display(Markdown(f"```\n{output}\n```"))

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.02s/it, est. speed input: 10.71 toks/s, output: 27.64 toks/s]


```
<reasoning>
To answer this question, we need to count the occurrences of the letter 'r' in the word "strawberry". The word "strawberry" contains the following instances of 'r': 'strawberry'. We can break it down as follows:
1. s
2. t
3. r
4. a
5. w
6. b
7. e
8. r
9. r
10. y

</reasoning>
<answer>
3
</answer>
```

In [34]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)


output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

# 将输出转为Markdown格式
display(Markdown(f"```\n{output}\n```"))

Processed prompts: 100%|██████████| 1/1 [00:18<00:00, 18.62s/it, est. speed input: 1.72 toks/s, output: 29.11 toks/s]


```
Calculating the value of pi (π) is an interesting problem that has fascinated mathematicians for centuries. Pi is an irrational number, meaning it cannot be expressed exactly as a simple fraction and its decimal representation goes on infinitely without repeating. 

There are many algorithms to approximate pi to any desired degree of accuracy. Here's a simple method using the Leibniz formula for π, which is an infinite series:

\[ \pi = 4 \times (1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \ldots) \]

Let's calculate the first few terms to get an approximation:

\[ \pi \approx 4 \times \left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9}\right) \]

\[ \pi \approx 4 \times \left(1 - 0.3333 + 0.2 - 0.1429 + 0.1111\right) \]

\[ \pi \approx 4 \times (0.8349) \]

\[ \pi \approx 3.3396 \]

This is a very rough approximation. To get a more accurate value, we need to add more terms. For instance, adding more terms will yield a better approximation:

\[ \pi \approx 4 \times \left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \frac{1}{11} + \frac{1}{13} - \frac{1}{15} + \frac{1}{17} - \frac{1}{19}\right) \]

\[ \pi \approx 4 \times (0.8333333333333334) \]

\[ \pi \approx 3.3333333333333335 \]

Clearly, the Leibniz formula converges very slowly. For practical purposes, more advanced algorithms like the Chudnovsky algorithm or the Bailey–Borwein–Plouffe (BBP) formula are used to compute pi to many more decimal places efficiently.

If you need a specific number of decimal places, let me know, and I can provide the value of pi to that level of precision.
```

In [33]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora(Lora_path),
)[0].outputs[0].text

# output = model.fast_generate(
#     [text],
#     sampling_params = sampling_params,
#     lora_request = None,
# )[0].outputs[0].text

# 将输出转为Markdown格式
display(Markdown(f"```\n{output}\n```"))

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.98s/it, est. speed input: 9.55 toks/s, output: 27.65 toks/s]


```
<reasoning>
Calculating the exact value of pi is not possible due to its irrational nature. However, we can approximate pi to any desired degree of accuracy using various methods, such as the Monte Carlo method, the Gregory-Leibniz series, or the Bailey–Borwein–Plouffe (BBP) formula. For practical purposes, pi is often approximated to 3.14159.
</reasoning>
<answer>
3.14159 (approximation)
</answer>
```