<a href="https://colab.research.google.com/github/pramodith/llm_exploration/blob/pramodith%2Friddllama/riddle_reasoning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Riddle LLama
This notebook trains a reasoning model to answer riddles. Riddles are reasoning heavy driven tasks. A model needs to be able to learn to associate different facts/concepts together to coherently come up with the right answer.

In [1]:
import os
if "COLAB_" not in "".join(os.environ.keys()):
    %pip install unsloth vllm flashinfer-python datasets litellm scikit-learn --resume-retries 3
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    %pip install --no-deps unsloth vllm

%pip install -U ipywidgets

Collecting unsloth
  Downloading unsloth-2025.5.6-py3-none-any.whl.metadata (46 kB)
Collecting vllm
  Downloading vllm-0.8.5.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (14 kB)
Collecting flashinfer-python
  Downloading flashinfer_python-0.2.5.tar.gz (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting litellm
  Downloading litellm-1.70.0-py3-none-any.whl.metadata (38 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting unsloth_zoo>=2025.5.7 (from unsloth)
  Downloading unsloth_zoo-2025.5.7-py3-none-any.whl.metadata (8.0 kB)
Collecting torch>=2.4.0 (

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [1]:
import unsloth
from datasets import load_dataset, Dataset

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-20 13:36:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-20 13:36:43 [__init__.py:239] Automatically detected platform cuda.


In [65]:
# riddles_dataset = load_dataset("mlfoundations-dev/riddle_sense")
riddles_dataset  = load_dataset("Hypersniper/riddles_v1")["train"]

In [12]:
riddles_dataset[7]

{'answer': 'cards',
 'output': "The answer to this question lies in the realm of playing cards. Let's break down the logic. \n\nThe number 13 is significant in a deck of cards as each suit has 13 cards: Ace through 10, and the three picture cards: Jack, Queen, and King. Now, let's think about the 'hearts' part of the question. In a standard deck of cards, there are four suits: hearts, diamonds, clubs, and spades. One of these suits is hearts.\n\nTherefore, in a deck of cards, there are 13 hearts. However, these hearts do not beat as they are not living, they are simply a suit in a deck of cards. They are symbolic hearts, not biological ones. So, based on this logical deduction, we can conclude that a deck of cards is what has 13 hearts but none that beat. \n\nThis type of riddle requires both literal and metaphorical thinking. The number 13 and the word 'hearts' might initially lead one to think of something biological or living because hearts are typically associated with living being

In [66]:
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format, the answer section must be as concise as possible and all the thinking/reasoning should be within the think tags:
<think>
...
</think>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<think>
{think}
</think>
<answer>
{answer}
</answer>
"""


In [67]:
reformatted_riddles_dataset = riddles_dataset.map(
    lambda x: {
        "question": x["instruction"],
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["instruction"]},
        ],
    },
)

Format the dataset to contain a __prompt__ key and an __answer__ key.

In [68]:
def ml_foundations_reformat():
    reformatted_riddles_dataset = []
    labels = {}
    for doc in riddles_dataset["train"]:
        for choice in doc["question"]["choices"]:
            labels[choice["label"]] = choice["text"]

        answer = labels[doc["answerKey"]]
        prompt= [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": doc["question"]["stem"]},
        ]
        reformatted_riddles_dataset.append({
            "question": doc["question"]["stem"],
            "prompt": prompt,
            "answer": answer,
        })

In [69]:
from dotenv import load_dotenv
load_dotenv()

True

## Dataset Filtering

In [70]:
import litellm
from litellm import batch_completion
litellm.num_retries = 2  # Retry 5 times

from jinja2 import Environment

environment = Environment()
riddle_quality_system_prompt = """
You are a helpful assistant that evaluates the quality of a riddle.
You will be given a question and its answer. Score the question and answer based on the
following rubric:

0 if the question is not a riddle or the answer is not a riddle answer.
1 if the riddle is very easy and the answer is obvious.
2 if the riddle is moderately hard to solve.
3 if the riddle is hard and needs a lot of think and multiple associations to get to the answer.

Here are some examples:
Question: Where does a person put their phone when they are walking?
Answer: In their pocket.
Score: 0

Question: What has keys but can't open locks?
Answer: A piano.
Score: 1

Question: What has a heart that doesn't beat?
Answer: An artichoke.
Score: 2

Question: What has a head and a tail but no body?
Answer: A coin.
Score: 1

Question: Many minds, but not a single face,
I make decisions with layered grace.
Each vote I cast makes outcomes clear,
With randomness keeping bias in fear.
What am I?
Answer: Random forest.
Score: 3
"""

In [71]:
reformatted_riddles_dataset[0:2]

{'answer': ['music', 'onion'],
 'output': ["The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp th

In [72]:
import time
from tqdm import tqdm
def score_riddles(dataset):
    scores = []
    for doc in tqdm(range(0, len(dataset), 16)):

        questions, answers =  zip(*[(q, a) for q, a in zip(dataset[doc:doc+16]["question"], dataset[doc:doc+16]["answer"])])
        responses = batch_completion(
            model = "openai/gpt-4.1-2025-04-14",
            temperature=0.0,
            max_tokens=6,
            messages = [
                [
                    {"role": "system", "content": riddle_quality_system_prompt},
                    {"role": "user", "content": f"Please score the following riddle:\nQuestion :{question}\nAnswer: {answer}\nScore: " }
                ]
                for question, answer in zip(questions, answers)
            ],
        )
        time.sleep(10)
        try:
            scores.extend([r.choices[0].message.content for r in responses])
        except Exception as e:
            print(e)
            return responses
    return scores

In [73]:
import re
def extract_score(scores, dataset):
    extracted_scores = []
    for ind, score in enumerate(scores):
        if isinstance(score, str):
            score = re.findall(r"Score: (\d)", score)
            # print(score)
            if len(score) == 0:
                extracted_scores.append(-1)
            else:
                extracted_scores.append(int(score[0]))

    return extracted_scores


In [78]:
scores = score_riddles(reformatted_riddles_dataset)
# reformatted_riddles_dataset = extract_score(scores, reformatted_riddles_dataset)

100%|██████████| 30/30 [05:42<00:00, 11.42s/it]


In [79]:
scores

['Score: 2',
 'Score: 1',
 'Score: 2',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 1',
 'Score: 1',
 'Score: 2\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2\n\nReason',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 3\n\nExplanation',
 'Score: 2',
 'Score: 3\n\nExplanation',
 'Score: 2',
 'Score: 2',
 'Score: 3\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2',
 'Score: 2',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 2\n\nExplanation',
 'Score: 1\n\nExplanation',
 'Score: 1',
 'Score: 1',
 'Score: 2',
 'Score: 3\n\nExplanation',
 'Score: 2',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 1',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2\n\nExplanation',
 'Score: 2',
 'Score: 2',
 'Score: 1',
 'Score: 2\n\n

In [80]:
extracted_scores = extract_score(scores, reformatted_riddles_dataset)
reformatted_riddles_dataset = reformatted_riddles_dataset.add_column("quality_score", extracted_scores)

In [81]:
reformatted_riddles_dataset[0]

{'answer': 'music',
 'output': "The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp their feet'. I

In [82]:
reformatted_riddles_dataset = Dataset.from_list(reformatted_riddles_dataset)
reformatted_riddles_dataset.save_to_disk("./riddles_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/469 [00:00<?, ? examples/s]

Load scored/graded dataset

In [83]:
reformatted_riddles_dataset = Dataset.load_from_disk("./riddles_dataset")

In [84]:
reformatted_riddles_dataset = [dict(row) for row in reformatted_riddles_dataset if row["quality_score"] >1]

In [85]:
reformatted_riddles_dataset[0]

{'answer': 'music',
 'output': "The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp their feet'. I

Create train, val and test splits of the dataset.

In [86]:
from sklearn.model_selection import train_test_split
train, not_train = train_test_split(reformatted_riddles_dataset, test_size=0.2, random_state=42)
dev, test = train_test_split(not_train, test_size=0.5, random_state=42)

In [87]:
# Convert list of dicts to a huggingface dataset
from datasets import Dataset
train_dataset = Dataset.from_list(train)
dev_dataset = Dataset.from_list(dev)
test_dataset = Dataset.from_list(test)

In [88]:
train_dataset[0]

{'answer': 'pride',
 'output': 'Pride is a concept that fits the description of the question perfectly. Here\'s why:\n\nSwallowing pride: This phrase is often used in our daily conversations and literature. It means to put aside one\'s personal feelings, especially when they are perceived as being too self-important, to accept something humiliating or to admit one\'s mistakes. In this context, "swallowing" is metaphorical. \n\nPride swallowing you: On the other hand, pride can also "swallow" a person. This means that when someone is overly proud, it can consume them and negatively affect their judgement, relationships, and overall wellbeing. This is usually the case when pride turns into arrogance. Once again, "swallowing" is used metaphorically here.\n\nThus, pride is something one can swallow, and it is also something that can swallow a person. The interpretation of the question and answer revolves around understanding the metaphorical usage of the word "swallow". It\'s a great examp

In [89]:
print(f"Train size: {len(train)}")
print(f"Dev size: {len(dev)}")
print(f"Test size: {len(test)}")

Train size: 304
Dev size: 38
Test size: 39


In [90]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_xml_think(text: str) -> str:
    think = text.split("<think>")[-1]
    think = think.split("</think>")[0]
    return think.strip()


### Unsloth
Load up `Qwen 2.5 3B Instruct`, and set parameters

In [None]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer think traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.8, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    # layers_pattern=r"model.layers\.\d+\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$",  # This is a pattern to match layer modules

    # layers_to_transform = list(range(20, 28)),
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.5.6: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA A40. Num GPUs = 1. Max memory: 44.451 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 79.46%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 44.45 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 320.
Unsloth: vLLM's KV Cache can use up to 32.9 GB. Also swap space = 6 GB.
INFO 05-20 14:49:51 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 05-20 14:49:52 [config.py:200

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 05-20 14:49:56 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_str

2025-05-20 14:49:56,682 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


INFO 05-20 14:49:57 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-20 14:49:57 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 05-20 14:49:57 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 05-20 14:49:57 [gpu_model_runner.py:1329] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 05-20 14:49:57 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 05-20 14:49:58 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

### Reward functions

In [None]:
import random
# Reward functions
def correctness_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    if random.random() < 0.05:
        print('-'*20)
        print(f"Question:\n{q}")
        print(f"Correct Answer:\n{answer[0]}")
        print(f"Responses:\n{responses[0]}")
        print(f"Extracted:\n{extracted_responses[0]}")
    
    return [1 + qs if r.strip() != "" and (a.lower() in r.lower() or r.lower() in a.lower()) else 0 for r, a, qs in zip(extracted_responses, answer, quality_score)]

def answer_length_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards  = []
    for r, a in zip(extracted_responses, answer):
        rewards.append(0)
        if a in r or r in a:
            if len(r) > len(a):
                rewards[-1] = -0.01 * (len(r) - len(a))
        else:
            if len(r) < len(a):
                rewards[-1] = -0.05 * (len(a) - len(r))
    return rewards


def think_length_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_think(r) for r in responses]
    rewards = [0.5 if 50 < len(r) < 500 else -0.5 for r in extracted_responses]
    return rewards

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.1 if match else -0.5 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<think>\n") == 1:
        count += 0.125
    if text.count("\n</think>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
    if text.count("\n</answer>") == 1:
        count += 0.125
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## Model Training

In [None]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 1e-5,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    do_eval = True,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 500,
    eval_steps = 100,
    save_steps = 100,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 4 to the `num_generations` of 8


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        correctness_reward_func,
        answer_length_reward_func,
        think_length_reward_func
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = dev_dataset
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 375 | Num Epochs = 2 | Total steps = 500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 73,859,072/5,000,000,000 (1.48% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / correctness_reward_func,rewards / answer_length_reward_func,rewards / think_length_reward_func
1,0.0,-0.375,0.517549,108.875,0.0,0.0,-0.5,0.0,0.0,0.125
2,0.0,-0.5,0.534522,100.0,0.0,0.0,-0.5,0.0,0.0,0.0
3,0.0,-0.25,0.46291,72.0,0.001014,0.0,-0.5,0.0,0.0,0.25
4,0.0,-0.375,0.517549,120.375,0.001121,0.0,-0.5,0.0,0.0,0.125
5,0.0002,-0.5,0.534522,40.875,0.0049,0.0,-0.5,0.0,0.0,0.0
6,0.0,-1.26125,2.795969,109.125,0.000995,0.0,-0.5,0.375,-1.26125,0.125
7,0.0,-1.485,2.042771,88.75,0.001179,0.0,-0.5,1.125,-2.235,0.125
8,0.0,-0.125,0.353553,82.25,0.001052,0.0,-0.5,0.0,0.0,0.375
9,0.0,-0.75,0.46291,126.75,0.001025,0.0,-0.5,0.0,0.0,-0.25
10,0.0,0.0,0.0,49.25,0.001002,0.0,-0.5,0.0,0.0,0.5


Unsloth: Will smartly offload gradients to save VRAM!
--------------------
Question:
I’m lighter than a feather yet the strongest person can’t hold me for more than six minutes. What am I?
Correct Answer:
Your breath.
Responses:
['The answer is a thought. Weighing nothing, thoughts can be lighter than feathers, but they are difficult to hold onto due to their ephemeral nature, and people are limited in how deeply they can sink their concentration into just one thought for longer than six minutes.', 'The answer is an ant.', "You are the wind. Knowing this, while still maintaining lightness, the question seems to reference the extremely minimal resistance that the wind offers, even to steel cables. This self-explaining nature keeps the wind light and negligible yet strong enough (often used in hurricanes and wind turbines) to cause damage. Six minutes due to the wind's consistent speed (at typical outdoors conditions) challenges the physical strength and endurance of humans. The wind met

KeyboardInterrupt: 

In [48]:
model.save_lora("grpo_saved_lora")

In [49]:
model.save_pretrained_merged("riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 79.59 out of 125.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 129.72it/s]


Unsloth: Saving tokenizer... Done.
Done.


In [50]:
import os
hf_token = os.getenv("hf_token")

In [51]:
model.push_to_hub_merged("Pramodith/riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit", token = hf_token)


Unsloth: You are pushing to hub, but you passed your HF username = Pramodith.
We shall truncate Pramodith/riddle_qwen2.5-1.5B to riddle_qwen2.5-1.5B


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 79.61 out of 125.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 163.30it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.


README.md:   0%|          | 0.00/621 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Pramodith/riddle_qwen2.5-1.5B


In [None]:
!pip install lighteval[math]
# !pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/ --no-deps

In [1]:
!lighteval vllm "pretrained=Pramodith/riddle_qwen2.5-1.5B,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" "lighteval|aime24|0|0" \
    --use-chat-template \
    --output-dir .

[2025-04-11 16:32:21,050] [[32m    INFO[0m]: PyTorch version 2.6.0 available. (config.py:54)[0m
INFO 04-11 16:32:23 [__init__.py:239] Automatically detected platform cuda.
[2025-04-11 16:32:24,390] [[32m    INFO[0m]: --- LOADING MODEL --- (pipeline.py:188)[0m
[2025-04-11 16:32:31,807] [[32m    INFO[0m]: This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'. (config.py:585)[0m
[2025-04-11 16:32:31,871] [[32m    INFO[0m]: Chunked prefill is enabled with max_num_batched_tokens=8192. (config.py:1697)[0m
[2025-04-11 16:32:33,810] [[32m    INFO[0m]: Initializing a V1 LLM engine (v0.8.2) with config: model='Pramodith/riddle_qwen2.5-1.5B', speculative_config=None, tokenizer='Pramodith/riddle_qwen2.5-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=aut