<a href="https://colab.research.google.com/github/pramodith/llm_exploration/blob/pramodith%2Friddllama/riddle_reasoning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Riddle LLama
This notebook trains a reasoning model to answer riddles. Riddles are reasoning heavy driven tasks. A model needs to be able to learn to associate different facts/concepts together to coherently come up with the right answer.

In [2]:
%pip install unsloth vllm flashinfer-python datasets litellm scikit-learn bitsandbytes ipywidgets

Note: you may need to restart the kernel to use updated packages.


In [4]:
import unsloth
from datasets import load_dataset, Dataset

In [None]:
# riddles_dataset = load_dataset("mlfoundations-dev/riddle_sense")
riddles_dataset  = load_dataset("Hypersniper/riddles_v1")["train"]

In [None]:
riddles_dataset[7]

In [5]:
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format, the answer section must be as concise as possible and all the thinking/reasoning should be within the think tags:
<think>
...
</think>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<think>
{think}
</think>
<answer>
{answer}
</answer>
"""


In [6]:
reformatted_riddles_dataset = riddles_dataset.map(
    lambda x: {
        "question": x["instruction"],
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["instruction"]},
        ],
    },
)

NameError: name 'riddles_dataset' is not defined

Format the dataset to contain a __prompt__ key and an __answer__ key.

In [None]:
def ml_foundations_reformat():
    reformatted_riddles_dataset = []
    labels = {}
    for doc in riddles_dataset["train"]:
        for choice in doc["question"]["choices"]:
            labels[choice["label"]] = choice["text"]

        answer = labels[doc["answerKey"]]
        prompt= [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": doc["question"]["stem"]},
        ]
        reformatted_riddles_dataset.append({
            "question": doc["question"]["stem"],
            "prompt": prompt,
            "answer": answer,
        })

In [None]:
from dotenv import load_dotenv
load_dotenv()

## Dataset Filtering

In [7]:
import litellm
from litellm import batch_completion
litellm.num_retries = 2  # Retry 5 times

from jinja2 import Environment

environment = Environment()
riddle_quality_system_prompt = """
You are a helpful assistant that evaluates the quality of a riddle.
You will be given a question and its answer. Score the question and answer based on the
following rubric:

0 if the question is not a riddle or the answer is not a riddle answer.
1 if the riddle is very easy and the answer is obvious.
2 if the riddle is moderately hard to solve.
3 if the riddle is hard and needs a lot of think and multiple associations to get to the answer.

Here are some examples:
Question: Where does a person put their phone when they are walking?
Answer: In their pocket.
Score: 0

Question: What has keys but can't open locks?
Answer: A piano.
Score: 1

Question: What has a heart that doesn't beat?
Answer: An artichoke.
Score: 2

Question: What has a head and a tail but no body?
Answer: A coin.
Score: 1

Question: Many minds, but not a single face,
I make decisions with layered grace.
Each vote I cast makes outcomes clear,
With randomness keeping bias in fear.
What am I?
Answer: Random forest.
Score: 3
"""

In [8]:
reformatted_riddles_dataset[0:2]

NameError: name 'reformatted_riddles_dataset' is not defined

In [9]:
import time
from tqdm import tqdm
def score_riddles(dataset):
    scores = []
    for doc in tqdm(range(0, len(dataset), 16)):

        questions, answers =  zip(*[(q, a) for q, a in zip(dataset[doc:doc+16]["question"], dataset[doc:doc+16]["answer"])])
        responses = batch_completion(
            model = "openai/gpt-4.1-2025-04-14",
            temperature=0.0,
            max_tokens=6,
            messages = [
                [
                    {"role": "system", "content": riddle_quality_system_prompt},
                    {"role": "user", "content": f"Please score the following riddle:\nQuestion :{question}\nAnswer: {answer}\nScore: " }
                ]
                for question, answer in zip(questions, answers)
            ],
        )
        time.sleep(10)
        try:
            scores.extend([r.choices[0].message.content for r in responses])
        except Exception as e:
            print(e)
            return responses
    return scores

In [None]:
import re
def extract_score(scores, dataset):
    extracted_scores = []
    for ind, score in enumerate(scores):
        if isinstance(score, str):
            score = re.findall(r"Score: (\d)", score)
            # print(score)
            if len(score) == 0:
                extracted_scores.append(-1)
            else:
                extracted_scores.append(int(score[0]))

    return extracted_scores


In [None]:
scores = score_riddles(reformatted_riddles_dataset)
# reformatted_riddles_dataset = extract_score(scores, reformatted_riddles_dataset)

In [None]:
scores

In [None]:
extracted_scores = extract_score(scores, reformatted_riddles_dataset)
reformatted_riddles_dataset = reformatted_riddles_dataset.add_column("quality_score", extracted_scores)

In [None]:
reformatted_riddles_dataset[0]

In [None]:
reformatted_riddles_dataset = Dataset.from_list(reformatted_riddles_dataset)
reformatted_riddles_dataset.save_to_disk("./riddles_dataset")
reformatted_riddles_dataset.push_to_hub("Pramodith/riddles_dataset_scored", private=True, token=True)

Load scored/graded dataset

In [13]:
reformatted_riddles_dataset = load_dataset("Pramodith/riddles_dataset_scored", token=True)["train"]

In [12]:
reformatted_riddles_dataset

DatasetDict({
    train: Dataset({
        features: ['answer', 'output', 'instruction', 'question', 'prompt', 'quality_score'],
        num_rows: 469
    })
})

In [14]:
reformatted_riddles_dataset = [dict(row) for row in reformatted_riddles_dataset if row["quality_score"] > 1]

In [15]:
reformatted_riddles_dataset[0]

{'answer': 'music',
 'output': "The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp their feet'. I

Create train, val and test splits of the dataset.

In [16]:
from sklearn.model_selection import train_test_split
train, not_train = train_test_split(reformatted_riddles_dataset, test_size=0.2, random_state=42)
dev, test = train_test_split(not_train, test_size=0.5, random_state=42)

In [18]:
# Convert list of dicts to a huggingface dataset
from datasets import Dataset
train_dataset = Dataset.from_list(train)
dev_dataset = Dataset.from_list(dev)
test_dataset = Dataset.from_list(test)

In [19]:
train_dataset[0]

{'answer': 'nose',
 'output': 'The question seems to be a riddle. To understand the answer, we need to break down the riddle and analyze each part. \n\nThe first part, "Two little holes in the side of a hill," is referring to something that has two small openings. This might make you think of a variety of things, like burrows or caves. However, it\'s crucial to remember that riddles often use metaphorical or symbolic language, and \'the side of a hill\' could be referring to the shape or structure of something rather than a literal hill.\n\nThe second part, "Just as you come to the cherry-red mill," might seem unrelated at first. But when you think about what a \'cherry-red mill\' could symbolize, it starts to make sense. A mill is a place where things are processed or ground, and \'cherry-red\' could be referring to the color. So, it\'s something that is red and processes or grinds.\n\nNow, let\'s put these two parts together. What has two openings and is located on a structure that c

In [20]:
print(f"Train size: {len(train)}")
print(f"Dev size: {len(dev)}")
print(f"Test size: {len(test)}")

Train size: 305
Dev size: 38
Test size: 39


In [21]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_xml_think(text: str) -> str:
    think = text.split("<think>")[-1]
    think = think.split("</think>")[0]
    return think.strip()


### Unsloth
Load up `Qwen 2.5 3B Instruct`, and set parameters

In [19]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer think traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.8, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    layers_pattern=r"model.layers\.\d+\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$",  # This is a pattern to match layer modules

    layers_to_transform = list(range(20, 28)),
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.5.7: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 19.25%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 1.5 GB. Also swap space = 2 GB.
INFO 05-22 16:01:26 [config.py:717] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-22 16:01:30 [model_runner.py:1140] Model loading took 1.5630 GiB and 1.935294 seconds
INFO 05-22 16:01:32 [worker.py:287] Memory profiling takes 1.09 seconds
INFO 05-22 16:01:32 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.19) = 2.84GiB
INFO 05-22 16:01:32 [worker.py:287] model weights take 1.56GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.69GiB; the rest of the memory reserved for KV Cache is 0.58GiB.
INFO 05-22 16:01:32 [executor_base.py:112] # cuda blocks: 1367, # CPU blocks: 4681
INFO 05-22 16:01:32 [executor_base.py:117] Maximum concurrency for 512 tokens per request: 42.72x
INFO 05-22 16:01:36 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasi

Capturing CUDA graph shapes:   0%|          | 0/19 [00:00<?, ?it/s]

INFO 05-22 16:02:12 [model_runner.py:1592] Graph capturing finished in 36 secs, took 0.30 GiB
INFO 05-22 16:02:12 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 41.86 seconds
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'pre_feedforward_layernorm', 'k_norm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'pre_feedforward_layernorm', 'k_norm']


Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.5.7 patched 28 layers with 28 QKV layers, 8 O layers and 8 MLP layers.


### Reward functions

In [20]:
import random
# Reward functions
def correctness_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    if random.random() < 0.05:
        print('-'*20)
        print(f"Question:\n{q}")
        print(f"Correct Answer:\n{answer[0]}")
        print(f"Responses:\n{responses[0]}")
        print(f"Extracted:\n{extracted_responses[0]}")
    
    return [1 + qs if r.strip() != "" and (a.lower() in r.lower() or r.lower() in a.lower()) else 0 for r, a, qs in zip(extracted_responses, answer, quality_score)]

def answer_length_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards  = []
    for r, a in zip(extracted_responses, answer):
        rewards.append(0)
        if a in r or r in a:
            if len(r) > len(a):
                rewards[-1] = -0.001 * (len(r) - len(a))
        else:
            if len(r) < len(a):
                rewards[-1] = -0.005 * (len(a) - len(r))
    return rewards


def think_length_reward_func(prompts, completions, answer, quality_score, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_think(r) for r in responses]
    rewards = [0.5 if 50 < len(r) < 500 else -0.5 for r in extracted_responses]
    return rewards

def count_xml(text) -> float:
    count = 0.0
    if text.count("<think>") == 1:
        count += 0.125
    if text.count("</think>") == 1:
        count += 0.125
    if text.count("<answer>") == 1:
        count += 0.125
    if text.count("</answer>") == 1:
        count += 0.125
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## Model Training

In [29]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 1e-5,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    do_eval = True,
    eval_strategy='steps',
    per_device_eval_batch_size=8,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 512,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 50,
    eval_steps = 20,
    save_steps = 20,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [30]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        correctness_reward_func,
        answer_length_reward_func,
        think_length_reward_func
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = dev_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 305 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 1 x 1) = 16
 "-____-"     Trainable parameters = 21,102,592/5,000,000,000 (0.42% trained)


Step,Training Loss,Validation Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / correctness_reward_func,rewards / answer_length_reward_func,rewards / think_length_reward_func
20,0.0,8e-06,0,0,0,0,0,0,0,0
40,0.0,1.8e-05,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log


Unsloth: Will smartly offload gradients to save VRAM!


Unsloth: Input IDs of length 513 > the model's max sequence length of 512.
We shall truncate it ourselves. It's imperative if you correct this issue first.
Unsloth: Input IDs of length 533 > the model's max sequence length of 512.
We shall truncate it ourselves. It's imperative if you correct this issue first.


--------------------
Question:
A house full, a yard full, a chimney full, no one can get a spoonful.
Correct Answer:
smoke
Responses:
The answer to this riddle is a clock. This six-letter word fits all the clues provided: the phrase "A clock is a timepiece" contains the word "clock," and the phrase "timepiece is intricately related to timekeeping and a clock, but can be seen as part of the clock itself, like a plate or bowl in a central location of the clock face, while 'timepiece' itself can also be used as a synonym for a plate, bowl or a pot (milk or tea) indirectly relating to a coffee cup or a spoonful of sugar is a prime example." You may visualize a clock face with a minute hand (30) and an hour hand (12). When the hands point to each other, they coincide at 9:30. So, visually attracting the hand to the 'spoonful' is an apt metaphor. The phrase 'a clock full' can indeed be used to represent a situation where someone or something is completely saturated or filled to the brim with

TrainOutput(global_step=50, training_loss=1.2345230390007344e-05, metrics={'train_runtime': 1563.4253, 'train_samples_per_second': 0.512, 'train_steps_per_second': 0.032, 'total_flos': 0.0, 'train_loss': 1.2345230390007344e-05})

In [None]:
model.save_lora("grpo_saved_lora")

In [None]:
model.save_pretrained_merged("riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit")

In [None]:
import os
hf_token = os.getenv("hf_token")

In [None]:
model.push_to_hub_merged("Pramodith/riddle_qwen2.5-3B", tokenizer, save_method = "merged_16bit", token = hf_token)


In [None]:
sample_riddle = "Question:\nI have keys but open no locks,I have space but no room, You can enter, but I’m not a door. What am I?"

messages = [{"role": "system", "content":SYSTEM_PROMPT}, {"role": "user", "content":sample_riddle}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
)
output_tokens = model.generate(torch.LongTensor([text]).to("cuda:0"), max_new_tokens=256, do_sample=False)

In [None]:
model

In [None]:
!pip install lighteval[math]
# !pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/ --no-deps

In [None]:
!lighteval vllm "pretrained=Pramodith/riddle_qwen2.5-3B,dtype=float16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" "lighteval|aime24|0|0" \
    --use-chat-template \
    --output-dir .