<a href="https://colab.research.google.com/github/ickma2311/mycolab/blob/main/LLM_GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO Explanation
#[blog](https://huggingface.co/blog/NormalUhr/grpo)
#[paper](https://arxiv.org/abs/2402.03300)

[CN](https://photos.google.com/photo/AF1QipMyeNzqcwt2s9b-tpfxHpky1dTBSf7yfm1GOtGn)

PPO’s Clipped Objective

The Proximal Policy Optimization (PPO) “clipped” loss is defined as:

$$
{L} = \mathbb{E}\Bigl[\min\bigl(\rho(\theta)\,A,\;\mathrm{clip}(\rho(\theta),\,1-\epsilon,\,1+\epsilon)\,A\bigr)\Bigr].
$$

### Components
*	Probability ratio

$\rho(\theta) = \frac{\pi_\theta(a\mid s)}{\pi_{\theta_{\rm old}}(a\mid s)}$

measures how much the new policy \pi_\theta differs from the old policy $\pi_{\theta_{\rm old}} $ in state s taking action a.


* Advantage A

$A = r + \gamma\,V(s’) - V(s)$

(or another estimator) reflects how much better taking action a in state s is compared to the baseline.

* Clipping

  * Unclipped term: $ \rho(\theta)\,A $    
  * Clipped term: $\mathrm{clip}(\rho(\theta),\,1-\epsilon,\,1+\epsilon)\,A$.    
  * We take the minimum of these two so that when $\rho$ moves outside $[1-\epsilon,1+\epsilon]$, the update is limited.    

* Expectation
$\mathbb{E}[\cdot]$ indicates averaging over all sampled state–action pairs (e.g., a batch).

### Intuition
1.	No change.   
When $\rho(\theta)\approx1$, the loss reduces to the standard policy-gradient term $\rho A$.
2.	Preventing large updates
If $\rho$ deviates from 1 by more than \epsilon, the clipped term caps it at $1\pm\epsilon$.    
	•	For $A>0$, taking the clipped term prevents overly large increases.    
	•	For $A<0$, taking the unclipped term prevents overly large decreases.     
3.	Stable, efficient learning
This design keeps the benefits of importance sampling (low variance) while enforcing a trust-region–like constraint, all without requiring second-order methods.


# Eval original Model

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt


In [3]:


from unsloth import FastLanguageModel, is_bfloat16_supported
# import torch
max_seq_length = 4096 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-17 21:27:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-17 21:27:19 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.5.6: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Qwen2.5-3B-Instruct with actual GPU utilization = 49.49%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens 

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

INFO 05-17 21:27:52 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='unsloth/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='unsloth/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Qwen2.5-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enab

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

INFO 05-17 21:28:19 [weight_utils.py:281] Time spent downloading weights for unsloth/Qwen2.5-3B-Instruct: 23.328125 seconds


model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 05-17 21:28:21 [loader.py:458] Loading weights took 1.80 seconds
INFO 05-17 21:28:21 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-17 21:28:21 [gpu_model_runner.py:1347] Model loading took 5.9933 GiB and 26.907532 seconds
INFO 05-17 21:28:41 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/2d51735133/rank_0_0 for vLLM's torch.compile
INFO 05-17 21:28:41 [backends.py:430] Dynamo bytecode transform time: 19.47 s


Inductor Compilation: 100%|██████████| 4/4 [00:00<00:00,  4.79it/s, triton_poi_fused_cat_3]

INFO 05-17 21:28:47 [backends.py:136] Cache the graph of shape None for later use



Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 13.06it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 125.45it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 121.44it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 105.84it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 115.99it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 121.23it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 125.53it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 121.24it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 120.16it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 129.15it/s, triton_poi_fused_cat_7]
Inductor Compilation: 100%|██████████| 8/8 [00:00<00:00, 109.64it/s, t

INFO 05-17 21:29:45 [backends.py:148] Compiling a graph for general shape takes 61.47 s





INFO 05-17 21:32:12 [monitor.py:33] torch.compile takes 80.94 s in total
INFO 05-17 21:32:18 [kv_cache_utils.py:634] GPU KV cache size: 105,680 tokens
INFO 05-17 21:32:18 [kv_cache_utils.py:637] Maximum concurrency for 4,096 tokens per request: 25.80x
INFO 05-17 21:33:38 [gpu_model_runner.py:1686] Graph capturing finished in 79 secs, took 0.77 GiB
INFO 05-17 21:33:38 [core.py:159] init engine (profile, create kv cache, warmup model) took 316.20 seconds


Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.


Unsloth: Just some info: will skip parsing ['q_norm', 'pre_feedforward_layernorm', 'post_feedforward_layernorm', 'k_norm']
Unsloth: Just some info: will skip parsing ['q_norm', 'pre_feedforward_layernorm', 'post_feedforward_layernorm', 'k_norm']


tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

## Evaluate


In [4]:
# %%capture

from datasets import load_dataset


ds = load_dataset("TIGER-Lab/MMLU-Pro")

README.md:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/4.16M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/45.3k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/12032 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/70 [00:00<?, ? examples/s]

In [5]:
test_ds=ds['test'].select(range(1000))

In [6]:
from string import ascii_uppercase
ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [7]:
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

"""
from tqdm import tqdm
from string import ascii_uppercase

text=[]
for item in tqdm(test_ds):
    t=f"The following are single choice questions  about {item['category']}. Answer the choice index, e.g. A/B/C ...."
    t+=SYSTEM_PROMPT
    t+=f'\nQuestion: {item["question"]}'
    choices=item['options']
    for index,choice in zip(ascii_uppercase[:len(choices)],choices):
        t+=f'\n{index}) {choice}'
    t+='\n Answer:'

    text.append(t)

text_with_template=[tokenizer.apply_chat_template([
        # {"role" : "system", "content" : SYSTEM_PROMPT},
        {"role":"user",
         'content':t}
    ],tokenize=False,add_generation_prompt=True)

      for t in text]

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.1,
    top_p = 0.95,
    max_tokens = 1024*4,
)

outputs=model.fast_generate(
    text_with_template,
    sampling_params = sampling_params,
    lora_request = None,
    use_tqdm=True

)




100%|██████████| 1000/1000 [00:00<00:00, 7434.01it/s]


Processed prompts:   0%|          | 0/1000 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s…

In [8]:

def parse_answer(output_):
  try:
    return output_.split('<answer>')[-1].split('</answer>')[0].strip()
  except:
    return ''


def correctness(output_):
  correct=0
  for o,a in zip(output_,test_ds['answer']):
    answer=parse_answer(o.outputs[0].text)
    if answer and answer[0]==a:
      correct+=1
  return correct/len(output_)



In [9]:
correctness(outputs)

0.369

In [10]:
text_with_template[0]

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThe following are single choice questions  about business. Answer the choice index, e.g. A/B/C ....\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n\n\nQuestion: Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.\nA) Safe practices, Fear, Jealousy, Trivial\nB) Unsafe practices, Distress, Joy, Trivial\nC) Safe practices, Wants, Jealousy, Trivial\nD) Safe practices, Distress, Fear, Trivial\nE) Unsafe practices, Wants, Jealousy, Serious\nF) Safe practices, Distress, Jealousy, Serious\nG) Safe practices, Wants, Fear, Serious\nH) Unsafe practices, Wants, Fear, Trivial\nI) Unsafe practices, Distress, Fear, Serious\n Answer:<|im_end|>\n<|im_start|>assistant\n'

In [11]:
outputs[0].outputs[0].text

'<reasoning>\nThe question is asking about typical advertising regulatory guidelines, which include avoiding content that encourages unsafe practices, causes unnecessary distress or fear, and avoids causing trivial offense. Let\'s analyze each option:\n\nA) Safe practices, Fear, Jealousy, Trivial - This option does not align with the guidelines as it mentions "safe practices," which is the opposite of what the guidelines aim to avoid.\nB) Unsafe practices, Distress, Joy, Trivial - Joy is not mentioned in the guidelines and does not fit the context.\nC) Safe practices, Wants, Jealousy, Trivial - Wants are not discouraged by the guidelines, and the term "trivial" is not relevant to the context.\nD) Safe practices, Distress, Fear, Trivial - This option aligns with the guidelines as it mentions avoiding unsafe practices, causing distress, and avoiding trivial offense.\nE) Unsafe practices, Wants, Jealousy, Serious - Wants are not discouraged by the guidelines, and the term "serious" is not

# GRPO Train

In [12]:
%%capture
!pip install evaluate sacrebleu
import evaluate

In [13]:
sacrebleu = evaluate.load("sacrebleu")

print(sacrebleu.compute(predictions=["this is a test"], references=["this is a test"]))

print(sacrebleu.compute(predictions=["this is another test"], references=["this is a test"]))

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

{'score': 100.00000000000004, 'counts': [4, 3, 2, 1], 'totals': [4, 3, 2, 1], 'precisions': [100.0, 100.0, 100.0, 100.0], 'bp': 1.0, 'sys_len': 4, 'ref_len': 4}
{'score': 35.35533905932737, 'counts': [3, 1, 0, 0], 'totals': [4, 3, 2, 1], 'precisions': [75.0, 33.333333333333336, 25.0, 25.0], 'bp': 1.0, 'sys_len': 4, 'ref_len': 4}


In [14]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

Unsloth 2025.5.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [15]:

import numpy as np
def correctness_reward(completions,answer,**kwargs):
  return [5 if a==parse_answer(c) else 0 for (c,a) in zip(completions,answer)]

def len_reward(completions,**kwargs):
  return [len(c)//1000 if len(c)<3000 else 0 for c in completions]

def format_reward_cal(c,a):
  reward=0
  if c.find('<reasoning>')!=-1:
    reward+=0.25
  if c.find('</reasoning>')!=-1:
    reward+=0.25
  if c.find('<answer>')!=-1:
    reward+=0.25
  if c.find('</answer>')!=-1:
    reward+=0.25
  if len(c.split('<answer>')[-1].split('</answer>')[0].strip())==len(a):
    reward+=1
  return reward


def bleu_score(completions,solutions,**kwargs):
  scores=[]
  for c,s in zip(completions,solutions):
    score=sacrebleu.compute(predictions=[c], references=[s])['score']/100
    scores.append(score)
  return scores

def format_reward(completions,answer,**kwargs):
  return [format_reward_cal(c,a) for c,a in zip(completions,answer)]

def exrtact_answer(a):
  try:
    return a.split('####')[1].strip()
  except:
    return ''

train_dataset_=load_dataset('open-r1/OpenR1-Math-220k','default')
train_dataset=[]
for item in train_dataset_['train']:
  prompt=tokenizer.apply_chat_template([
        {"role" : "system", "content" : SYSTEM_PROMPT},
        {"role":"user",
         'content':item['problem']}],
                                       tokenize=False,
                                      add_generation_prompt=True)
  answer=item['answer']
  solution=item['solution']

  train_dataset.append({'prompt':prompt,'answer':answer,'solutions':solution})




README.md:   0%|          | 0.00/5.13k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

data/train-00000-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

data/train-00001-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

data/train-00002-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

data/train-00003-of-00010.parquet:   0%|          | 0.00/217M [00:00<?, ?B/s]

data/train-00004-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

data/train-00005-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

data/train-00006-of-00010.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

data/train-00007-of-00010.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

data/train-00008-of-00010.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

data/train-00009-of-00010.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/93733 [00:00<?, ? examples/s]

In [16]:
from trl import GRPOConfig, GRPOTrainer
grpo_ampling_params = SamplingParams(
    temperature = 0.9,
    top_p = 1,
    max_tokens = 1024*3,
)
training_args = GRPOConfig(
    # use_vllm = True, # use vLLM for fast inference!,
    vllm_sampling_params = grpo_ampling_params,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 1024*3,
    num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 1500,
    save_steps = 200,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
       correctness_reward,
       format_reward,
       bleu_score
      #  len_reward
    ],
    args = training_args,
    train_dataset = train_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 93,733 | Num Epochs = 1 | Total steps = 1,500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272/3,205,672,960 (3.74% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / correctness_reward,rewards / format_reward,rewards / bleu_score
1,0.0,0.101195,0.019351,852.0,0.0,0.0,0.0,0.101194
2,0.0,1.084355,0.018024,776.25,0.0,0.0,1.0,0.084355
3,0.0,0.986796,0.407799,969.375,0.00031,0.0,0.78125,0.205546
4,0.0,5.625926,2.774296,183.875,0.000623,3.75,1.75,0.125926
5,0.0,0.293965,0.423473,484.625,0.000328,0.0,0.21875,0.075215
6,0.0,1.263578,0.097031,1250.375,0.000221,0.0,0.96875,0.294828
7,0.0,0.66875,0.531594,548.875,0.000516,0.0,0.625,0.04375


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
model.save_lora("grpo_saved_lora")

In [None]:
outputs=model.fast_generate(
    text_with_template,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
    use_tqdm=True
)


In [None]:
correctness(outputs)

In [None]:
from getpass import getpass
hf_token=getpass()

In [None]:
model.push_to_hub_merged("ickma2311/Qwen2.5-3B-GRPO", tokenizer, save_method = "merged_16bit", token = hf_token)