<a href="https://colab.research.google.com/github/pramodith/llm_exploration/blob/pramodith%2Friddllama/riddle_reasoning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Riddle LLama
This notebook trains a reasoning model to answer riddles. Riddles are reasoning heavy driven tasks. A model needs to be able to learn to associate different facts/concepts together to coherently come up with the right answer.

In [19]:
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.8.2 --no-deps
    !pip install triton==3.1.0
    !pip install scikit-learn
    !pip install numpy==1.26.4
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib
Successfully installed joblib-1.4.2 threadpoolctl-3.6.0
[0m

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [2]:
from datasets import load_dataset, Dataset

In [3]:
# riddles_dataset = load_dataset("mlfoundations-dev/riddle_sense")
riddles_dataset  = load_dataset("Hypersniper/riddles_v1")["train"]

README.md:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

riddles.json:   0%|          | 0.00/555k [00:00<?, ?B/s]

riddles_2.json:   0%|          | 0.00/112k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/469 [00:00<?, ? examples/s]

In [4]:
riddles_dataset[7]

{'answer': 'cards',
 'output': "The answer to this question lies in the realm of playing cards. Let's break down the logic. \n\nThe number 13 is significant in a deck of cards as each suit has 13 cards: Ace through 10, and the three picture cards: Jack, Queen, and King. Now, let's think about the 'hearts' part of the question. In a standard deck of cards, there are four suits: hearts, diamonds, clubs, and spades. One of these suits is hearts.\n\nTherefore, in a deck of cards, there are 13 hearts. However, these hearts do not beat as they are not living, they are simply a suit in a deck of cards. They are symbolic hearts, not biological ones. So, based on this logical deduction, we can conclude that a deck of cards is what has 13 hearts but none that beat. \n\nThis type of riddle requires both literal and metaphorical thinking. The number 13 and the word 'hearts' might initially lead one to think of something biological or living because hearts are typically associated with living being

In [31]:
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format, the answer section must be as concise as possible and all the thinking/reasoning should be within the reasoning tags:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""


In [6]:
reformatted_riddles_dataset = riddles_dataset.map(
    lambda x: {
        "question": x["instruction"],
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["instruction"]},
        ],
    },
)

Map:   0%|          | 0/469 [00:00<?, ? examples/s]

Format the dataset to contain a __prompt__ key and an __answer__ key.

In [7]:
def ml_foundations_reformat():
    reformatted_riddles_dataset = []
    labels = {}
    for doc in riddles_dataset["train"]:
        for choice in doc["question"]["choices"]:
            labels[choice["label"]] = choice["text"]

        answer = labels[doc["answerKey"]]
        prompt= [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": doc["question"]["stem"]},
        ]
        reformatted_riddles_dataset.append({
            "question": doc["question"]["stem"],
            "prompt": prompt,
            "answer": answer,
        })

## Dataset Filtering

In [8]:
import litellm
from litellm import batch_completion
litellm.num_retries = 2  # Retry 5 times

from jinja2 import Environment

environment = Environment()
riddle_quality_system_prompt = """
You are a helpful assistant that evaluates the quality of a riddle.
You will be given a question and its answer. Score the question and answer based on the
following rubric:

0 if the question is not a riddle or the answer is not a riddle answer.
1 if the riddle is very easy and the answer is obvious.
2 if the riddle is moderately hard to solve.
3 if the riddle is hard and needs a lot of reasoning and multiple associations to get to the answer.

Here are some examples:
Question: Where does a person put their phone when they are walking?
Answer: In their pocket.
Score: 0

Question: What has keys but can't open locks?
Answer: A piano.
Score: 1

Question: What has a heart that doesn't beat?
Answer: An artichoke.
Score: 2

Question: What has a head and a tail but no body?
Answer: A coin.
Score: 1

Question: Many minds, but not a single face,
I make decisions with layered grace.
Each vote I cast makes outcomes clear,
With randomness keeping bias in fear.
What am I?
Answer: Random forest.
Score: 3
"""

ModuleNotFoundError: No module named 'litellm'

In [9]:
reformatted_riddles_dataset[0:2]

{'answer': ['music', 'onion'],
 'output': ["The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp th

In [10]:
from tqdm import tqdm
def score_riddles(dataset):
    scores = []
    for doc in tqdm(range(0, len(dataset), 16)):

        questions, answers =  zip(*[(q, a) for q, a in zip(dataset[doc:doc+16]["question"], dataset[doc:doc+16]["answer"])])
        responses = batch_completion(
            model = "openai/gpt-4o-mini",
            temperature=0.0,
            max_tokens=6,
            messages = [
                [
                    {"role": "system", "content": riddle_quality_system_prompt},
                    {"role": "user", "content": f"Please score the following riddle:\nQuestion :{question}\nAnswer: {answer}" }
                ]
                for question, answer in zip(questions, answers)
            ],
        )
        try:
            scores.extend([r.choices[0].message.content for r in responses])
        except Exception as e:
            print(e)
            return responses
    return scores

In [11]:
import re
def extract_score(scores, dataset):
    extracted_scores = []
    for ind, score in enumerate(scores):
        if isinstance(score, str):
            score = re.findall(r"Score: (\d)", score)
            # print(score)
            if len(score) == 0:
                extracted_scores.append(-1)
            else:
                extracted_scores.append(int(score[0]))

    return extracted_scores


In [None]:
scores = score_riddles(reformatted_riddles_dataset)
# reformatted_riddles_dataset = extract_score(scores, reformatted_riddles_dataset)

In [None]:
scores[-1]

In [None]:
scores = extract_score(scores, reformatted_riddles_dataset)
reformatted_riddles_dataset = reformatted_riddles_dataset.add_column("quality_score", scores)

In [None]:
reformatted_riddles_dataset[0]

In [None]:
reformatted_riddles_dataset = Dataset.from_list(reformatted_riddles_dataset)
reformatted_riddles_dataset.save_to_disk("./riddles_dataset")

ds = Dataset.load_from_disk("riddles_dataset")

In [20]:
reformatted_riddles_dataset = [dict(row) for row in reformatted_riddles_dataset]

Create train, val and test splits of the dataset.

In [21]:
from sklearn.model_selection import train_test_split
train, not_train = train_test_split(reformatted_riddles_dataset, test_size=0.2, random_state=42)
dev, test = train_test_split(not_train, test_size=0.5, random_state=42)

In [22]:
# Convert list of dicts to a huggingface dataset
from datasets import Dataset
train_dataset = Dataset.from_list(train)
dev_dataset = Dataset.from_list(dev)
test_dataset = Dataset.from_list(test)

In [23]:
train_dataset[0]

{'answer': 'A mirror.',
 'output': 'The riddle is referring to an object that reacts to both physical and emotional stimuli. Let\'s break it down:\n\nFirstly, the line "Drop me and I’m sure to crack" indicates that the object in question is fragile. When you drop fragile items, they often break or crack. This could apply to many things such as glass, ceramics, eggs, and more. However, to figure out the exact object, we need to consider the second part of the riddle as well.\n\nThe second line "lend me a smile and I’ll certainly smile back" suggests that the object reflects what is presented in front of it. This reflection is not necessarily physical, but in this case, it is. When you smile at this object, it appears to smile back at you. Not many objects have this reflective quality, which narrows down our options.\n\nConsidering both lines of the riddle, the object must be something that is both fragile (can crack when dropped) and reflective (can mimic a smile). The object that fits 

In [24]:
print(f"Train size: {len(train)}")
print(f"Dev size: {len(dev)}")
print(f"Test size: {len(test)}")

Train size: 375
Dev size: 47
Test size: 47


In [25]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

### Unsloth
Load up `Qwen 2.5 3B Instruct`, and set parameters

In [26]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!




INFO 04-11 15:46:19 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.2. vLLM: 0.8.2.
   \\   /|    NVIDIA A100-PCIE-40GB. Num GPUs = 1. Max memory: 39.394 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.43%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.39 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 18.14 GB. Also swap space = 6 GB.
INFO 04-11 15:46:31 [config.py:585] This model supports multiple tasks: {'reward', 'classify', 'gene

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

INFO 04-11 15:46:35 [cuda.py:291] Using Flash Attention backend.
INFO 04-11 15:46:36 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-11 15:46:36 [model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
INFO 04-11 15:46:36 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 04-11 15:46:36 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

INFO 04-11 15:46:53 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit: 16.465836 seconds
INFO 04-11 15:46:53 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-11 15:46:54 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-11 15:46:54 [model_runner.py:1146] Model loading took 1.5708 GB and 18.396662 seconds
INFO 04-11 15:46:56 [worker.py:267] Memory profiling takes 1.19 seconds
INFO 04-11 15:46:56 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.49) = 19.47GiB
INFO 04-11 15:46:56 [worker.py:267] model weights take 1.57GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.57GiB; the rest of the memory reserved for KV Cache is 16.24GiB.
INFO 04-11 15:46:56 [executor_base.py:111] # cuda blocks: 38013, # CPU blocks: 14043
INFO 04-11 15:46:56 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 593.95x
INFO 04-11 15:47:00 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:24<00:00,  1.57it/s]

INFO 04-11 15:47:25 [model_runner.py:1570] Graph capturing finished in 25 secs, took 3.72 GiB
INFO 04-11 15:47:25 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 30.78 seconds



Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Reward functions

In [41]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [5.0 if a.lower() in r.lower() or r.lower() in a.lower() else 0.0 for r, a in zip(extracted_responses, answer)]

def answer_length_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards  = []
    for r, a in zip(extracted_responses, answer):
        rewards.append(0)
        if a in r or r in a:
            if len(r) > len(a):
                rewards[-1] = -0.001 * (len(r) - len(a))
        else:
            if len(r) > len(a):
                rewards[-1] = -0.005 * (len(r) - len(a))
    return rewards


def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
    if text.count("\n</answer>") == 1:
        count += 0.125
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## Model Training

In [42]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 500,
    save_steps = 100,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [43]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        correctness_reward_func,
        answer_length_reward_func
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = dev_dataset
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 375 | Num Epochs = 2 | Total steps = 500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 73,859,072/5,000,000,000 (1.48% trained)


-------------------- Question:
Flat as a leaf, round as a ring. Has two eyes, can't see a thing. 
Answer:
button 
Response:
The answer is a circle. 
Extracted:
The answer is a circle.


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / correctness_reward_func,rewards / answer_length_reward_func
1,0.0,-0.260625,0.396092,15.25,0.000527,0.0,0.0,0.0,-0.260625
2,0.0001,-1.77425,1.437216,75.875,0.001777,-0.0655,0.0,0.0,-1.70875
3,0.0001,-0.963125,1.836532,41.75,0.00182,0.0,0.0,0.0,-0.963125
4,0.0,-2.065625,1.095078,86.375,0.001138,0.0,0.0,0.0,-2.065625
5,0.0001,-0.058125,0.115941,6.75,0.002262,0.0,0.0,0.0,-0.058125
6,0.0,-1.548125,2.728174,100.875,0.001143,0.0,0.0,0.625,-2.173125
7,0.0001,1.518625,2.875422,20.25,0.001769,0.0,0.0,1.875,-0.356375
8,0.0001,-0.8125,0.436913,40.0,0.001364,0.0,0.0,0.0,-0.8125
9,0.0001,-0.445,0.850643,24.5,0.001251,0.0,0.0,0.0,-0.445
10,0.0001,-0.65375,0.496522,31.25,0.001353,0.0,0.0,0.0,-0.65375


-------------------- Question:
Different lights do make me strange, thus into different sizes I will change. 
Answer:
pupil 
Response:
Different lights can affect how you perceive size and shape, making you appear larger or smaller depending on the lighting conditions. This can make you seem strange or out of place in certain situations. 
Extracted:
Different lights can affect how you perceive size and shape, making you appear larger or smaller depending on the lighting conditions. This can make you seem strange or out of place in certain situations.
-------------------- Question:
There is a house. One enters it blind and comes out seeing. What is it? 
Answer:
A school. 
Response:
A darkroom. 
Extracted:
A darkroom.
-------------------- Question:
Although my cow is dead, I still beat her
 What a racket she makes. 
Answer:
drum 
Response:
The phrase "Although my cow is dead, I still beat her
 What a racket she makes" is a humorous exaggeration suggesting someone continues to engage in

TrainOutput(global_step=500, training_loss=0.08801518931858664, metrics={'train_runtime': 461.66, 'train_samples_per_second': 8.664, 'train_steps_per_second': 1.083, 'total_flos': 0.0, 'train_loss': 0.08801518931858664})

In [48]:
model.save_lora("grpo_saved_lora")

In [49]:
model.save_pretrained_merged("riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 79.59 out of 125.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 129.72it/s]


Unsloth: Saving tokenizer... Done.
Done.


In [50]:
import os
hf_token = os.getenv("hf_token")

In [51]:
model.push_to_hub_merged("Pramodith/riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit", token = hf_token)


Unsloth: You are pushing to hub, but you passed your HF username = Pramodith.
We shall truncate Pramodith/riddle_qwen2.5-1.5B to riddle_qwen2.5-1.5B


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 79.61 out of 125.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 163.30it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.


README.md:   0%|          | 0.00/621 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Pramodith/riddle_qwen2.5-1.5B


In [None]:
!pip install lighteval[math]
# !pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/ --no-deps

In [1]:
!lighteval vllm "pretrained=Pramodith/riddle_qwen2.5-1.5B,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" "lighteval|aime24|0|0" \
    --use-chat-template \
    --output-dir .

[2025-04-11 16:32:21,050] [[32m    INFO[0m]: PyTorch version 2.6.0 available. (config.py:54)[0m
INFO 04-11 16:32:23 [__init__.py:239] Automatically detected platform cuda.
[2025-04-11 16:32:24,390] [[32m    INFO[0m]: --- LOADING MODEL --- (pipeline.py:188)[0m
[2025-04-11 16:32:31,807] [[32m    INFO[0m]: This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'. (config.py:585)[0m
[2025-04-11 16:32:31,871] [[32m    INFO[0m]: Chunked prefill is enabled with max_num_batched_tokens=8192. (config.py:1697)[0m
[2025-04-11 16:32:33,810] [[32m    INFO[0m]: Initializing a V1 LLM engine (v0.8.2) with config: model='Pramodith/riddle_qwen2.5-1.5B', speculative_config=None, tokenizer='Pramodith/riddle_qwen2.5-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=aut