<a href="https://colab.research.google.com/github/pramodith/llm_exploration/blob/pramodith%2Friddllama/riddle_reasoning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Riddle LLama
This notebook trains a reasoning model to answer riddles. Riddles are reasoning heavy driven tasks. A model needs to be able to learn to associate different facts/concepts together to coherently come up with the right answer.

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [4]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [6]:
from datasets import load_dataset, Dataset

In [4]:
# riddles_dataset = load_dataset("mlfoundations-dev/riddle_sense")
riddles_dataset  = load_dataset("Hypersniper/riddles_v1")["train"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

riddles.json:   0%|          | 0.00/555k [00:00<?, ?B/s]

riddles_2.json:   0%|          | 0.00/112k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/469 [00:00<?, ? examples/s]

In [5]:
riddles_dataset[7]

{'answer': 'cards',
 'output': "The answer to this question lies in the realm of playing cards. Let's break down the logic. \n\nThe number 13 is significant in a deck of cards as each suit has 13 cards: Ace through 10, and the three picture cards: Jack, Queen, and King. Now, let's think about the 'hearts' part of the question. In a standard deck of cards, there are four suits: hearts, diamonds, clubs, and spades. One of these suits is hearts.\n\nTherefore, in a deck of cards, there are 13 hearts. However, these hearts do not beat as they are not living, they are simply a suit in a deck of cards. They are symbolic hearts, not biological ones. So, based on this logical deduction, we can conclude that a deck of cards is what has 13 hearts but none that beat. \n\nThis type of riddle requires both literal and metaphorical thinking. The number 13 and the word 'hearts' might initially lead one to think of something biological or living because hearts are typically associated with living being

In [7]:
# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""


In [8]:
reformatted_riddles_dataset = riddles_dataset.map(
    lambda x: {
        "question": x["instruction"],
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["instruction"]},
        ],
    },
)

Map:   0%|          | 0/469 [00:00<?, ? examples/s]

Format the dataset to contain a __prompt__ key and an __answer__ key.

In [None]:
def ml_foundations_reformat():
    reformatted_riddles_dataset = []
    labels = {}
    for doc in riddles_dataset["train"]:
        for choice in doc["question"]["choices"]:
            labels[choice["label"]] = choice["text"]

        answer = labels[doc["answerKey"]]
        prompt= [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": doc["question"]["stem"]},
        ]
        reformatted_riddles_dataset.append({
            "question": doc["question"]["stem"],
            "prompt": prompt,
            "answer": answer,
        })

## Dataset Filtering

In [None]:
import litellm
from litellm import batch_completion
litellm.num_retries = 2  # Retry 5 times

from jinja2 import Environment

environment = Environment()
riddle_quality_system_prompt = """
You are a helpful assistant that evaluates the quality of a riddle.
You will be given a question and its answer. Score the question and answer based on the
following rubric:

0 if the question is not a riddle or the answer is not a riddle answer.
1 if the riddle is very easy and the answer is obvious.
2 if the riddle is moderately hard to solve.
3 if the riddle is hard and needs a lot of reasoning and multiple associations to get to the answer.

Here are some examples:
Question: Where does a person put their phone when they are walking?
Answer: In their pocket.
Score: 0

Question: What has keys but can't open locks?
Answer: A piano.
Score: 1

Question: What has a heart that doesn't beat?
Answer: An artichoke.
Score: 2

Question: What has a head and a tail but no body?
Answer: A coin.
Score: 1

Question: Many minds, but not a single face,
I make decisions with layered grace.
Each vote I cast makes outcomes clear,
With randomness keeping bias in fear.
What am I?
Answer: Random forest.
Score: 3
"""

In [None]:
reformatted_riddles_dataset[0:2]

In [None]:
from tqdm import tqdm
def score_riddles(dataset):
    scores = []
    for doc in tqdm(range(0, len(dataset), 16)):

        questions, answers =  zip(*[(q, a) for q, a in zip(dataset[doc:doc+16]["question"], dataset[doc:doc+16]["answer"])])
        responses = batch_completion(
            model = "openai/gpt-4o-mini",
            temperature=0.0,
            max_tokens=6,
            messages = [
                [
                    {"role": "system", "content": riddle_quality_system_prompt},
                    {"role": "user", "content": f"Please score the following riddle:\nQuestion :{question}\nAnswer: {answer}" }
                ]
                for question, answer in zip(questions, answers)
            ],
        )
        try:
            scores.extend([r.choices[0].message.content for r in responses])
        except Exception as e:
            print(e)
            return responses
    return scores

In [None]:
import re
def extract_score(scores, dataset):
    extracted_scores = []
    for ind, score in enumerate(scores):
        if isinstance(score, str):
            score = re.findall(r"Score: (\d)", score)
            # print(score)
            if len(score) == 0:
                extracted_scores.append(-1)
            else:
                extracted_scores.append(int(score[0]))

    return extracted_scores


In [None]:
scores = score_riddles(reformatted_riddles_dataset)
# reformatted_riddles_dataset = extract_score(scores, reformatted_riddles_dataset)

In [None]:
scores[-1]

In [None]:
scores = extract_score(scores, reformatted_riddles_dataset)
reformatted_riddles_dataset = reformatted_riddles_dataset.add_column("quality_score", scores)

In [None]:
reformatted_riddles_dataset[0]

In [None]:
reformatted_riddles_dataset = Dataset.from_list(reformatted_riddles_dataset)
reformatted_riddles_dataset.save_to_disk("./riddles_dataset")

ds = Dataset.load_from_disk("riddles_dataset")

In [9]:
reformatted_riddles_dataset = [dict(row) for row in reformatted_riddles_dataset]

Create train, val and test splits of the dataset.

In [10]:
from sklearn.model_selection import train_test_split
train, not_train = train_test_split(reformatted_riddles_dataset, test_size=0.2, random_state=42)
dev, test = train_test_split(not_train, test_size=0.5, random_state=42)

In [11]:
# Convert list of dicts to a huggingface dataset
from datasets import Dataset
train_dataset = Dataset.from_list(train)
dev_dataset = Dataset.from_list(dev)
test_dataset = Dataset.from_list(test)

In [12]:
train_dataset[0]

{'answer': 'A mirror.',
 'output': 'The riddle is referring to an object that reacts to both physical and emotional stimuli. Let\'s break it down:\n\nFirstly, the line "Drop me and I’m sure to crack" indicates that the object in question is fragile. When you drop fragile items, they often break or crack. This could apply to many things such as glass, ceramics, eggs, and more. However, to figure out the exact object, we need to consider the second part of the riddle as well.\n\nThe second line "lend me a smile and I’ll certainly smile back" suggests that the object reflects what is presented in front of it. This reflection is not necessarily physical, but in this case, it is. When you smile at this object, it appears to smile back at you. Not many objects have this reflective quality, which narrows down our options.\n\nConsidering both lines of the riddle, the object must be something that is both fragile (can crack when dropped) and reflective (can mimic a smile). The object that fits 

In [13]:
print(f"Train size: {len(train)}")
print(f"Dev size: {len(dev)}")
print(f"Test size: {len(test)}")

Train size: 375
Dev size: 47
Test size: 47


In [14]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

### Unsloth
Load up `Qwen 2.5 3B Instruct`, and set parameters

In [15]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-09 15:05:17 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.50.3. vLLM: 0.8.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.53%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 5.97 GB.

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

INFO 04-09 15:05:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-09 15:05:53 [cuda.py:289] Using XFormers backend.
INFO 04-09 15:05:54 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-09 15:05:54 [model_runner.py:1110] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
INFO 04-09 15:05:54 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 04-09 15:05:57 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

INFO 04-09 15:06:04 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit: 7.224819 seconds
INFO 04-09 15:06:05 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-09 15:06:06 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-09 15:06:07 [model_runner.py:1146] Model loading took 1.5708 GiB and 12.235105 seconds
INFO 04-09 15:06:15 [worker.py:267] Memory profiling takes 7.22 seconds
INFO 04-09 15:06:15 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 04-09 15:06:15 [worker.py:267] model weights take 1.57GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 4.66GiB.
INFO 04-09 15:06:15 [executor_base.py:112] # cuda blocks: 10899, # CPU blocks: 4681
INFO 04-09 15:06:15 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 170.30x
INFO 04-09 15:06:18 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. I

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:50<00:00,  1.89s/it]

INFO 04-09 15:07:09 [model_runner.py:1598] Graph capturing finished in 51 secs, took 0.41 GiB
INFO 04-09 15:07:09 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 62.69 seconds





tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Reward functions

In [16]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if a.lower() in r.lower() or r.lower() in a.lower() else 0.0 for r, a in zip(extracted_responses, answer)]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## Model Training

In [17]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 100,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [18]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = dev_dataset
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 375 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 73,859,072/5,000,000,000 (1.48% trained)


-------------------- Question:
Flat as a leaf, round as a ring. Has two eyes, can't see a thing. 
Answer:
button 
Response:
The answer is: A pancake.
Explanation: A pancake is flat as a leaf when viewed from one direction, and round when viewed from another. A pancake has a hole in the center, so it can't see through itself. 
Extracted:
The answer is: A pancake.
Explanation: A pancake is flat as a leaf when viewed from one direction, and round when viewed from another. A pancake has a hole in the center, so it can't see through itself.


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / correctness_reward_func
1,0.0,0.0,0.0,65.875,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,116.875,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,66.375,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,143.625,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,15.75,0.0,0.0,0.0,0.0,0.0
6,0.0,0.75,1.035098,96.0,0.0,0.0,0.0,0.0,0.75
7,0.0,0.487875,0.933904,88.125,2e-05,-0.012125,0.0,0.0,0.5
8,0.0,0.0,0.0,51.75,1.6e-05,0.0,0.0,0.0,0.0
9,0.0,-0.009875,0.027931,79.75,3.4e-05,-0.009875,0.0,0.0,0.0
10,0.0,0.0,0.0,37.375,5e-05,0.0,0.0,0.0,0.0


-------------------- Question:
Different lights do make me strange, thus into different sizes I will change. 
Answer:
pupil 
Response:
This is a metaphorical representation. Different people respond to stimuli (like light) differently, causing them to perceive or act in various ways, which could lead to them becoming "strange" or in different "sizes" based on their individual experiences, reactions, and perceptions. The light acts as an intermediary, facilitating different responses and perceptions, which are then condensed into the concept of "changing into different sizes." 
Extracted:
This is a metaphorical representation. Different people respond to stimuli (like light) differently, causing them to perceive or act in various ways, which could lead to them becoming "strange" or in different "sizes" based on their individual experiences, reactions, and perceptions. The light acts as an intermediary, facilitating different responses and perceptions, which are then condensed into the c

TrainOutput(global_step=100, training_loss=4.7379250443704276e-05, metrics={'train_runtime': 880.2307, 'train_samples_per_second': 0.909, 'train_steps_per_second': 0.114, 'total_flos': 0.0, 'train_loss': 4.7379250443704276e-05})

In [19]:
model.save_lora("grpo_saved_lora")

In [20]:
model.save_pretrained_merged("riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit",)


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.5G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.41 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 21.85it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving riddle_qwen2.5-1.5B/pytorch_model.bin...
Done.


RuntimeError: Unsloth: Please supply a token!
Go to https://huggingface.co/settings/tokens

In [21]:
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

In [22]:
model.push_to_hub_merged("Pramodith/riddle_qwen2.5-1.5B", tokenizer, save_method = "merged_16bit", token = hf_token)


Unsloth: You are pushing to hub, but you passed your HF username = Pramodith.
We shall truncate Pramodith/riddle_qwen2.5-1.5B to riddle_qwen2.5-1.5B


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.33 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 53.03it/s]


Unsloth: Saving tokenizer...

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

 Done.
Unsloth: Saving riddle_qwen2.5-1.5B/pytorch_model.bin...


README.md:   0%|          | 0.00/614 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Pramodith/riddle_qwen2.5-1.5B


In [2]:
!pip install -qqq lighteval[math]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [5]:
import lighteval
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
from lighteval.utils.utils import EnvConfig
from lighteval.utils.imports import is_accelerate_available
from datetime import timedelta

In [6]:
is_accelerate_available()

True

In [11]:
if is_accelerate_available():
    from accelerate import Accelerator, InitProcessGroupKwargs
    accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=3000))])
else:
    accelerator = None

def run_benchmarks():
    evaluation_tracker = EvaluationTracker(
        output_dir="./results",
        save_details=True,
        push_to_hub=True,
        hub_results_org="Pramodith",
    )

    pipeline_params = PipelineParameters(
        launcher_type=ParallelismManager.ACCELERATE,
        env_config=EnvConfig(cache_dir="tmp/"),
        override_batch_size = 32
    )

    model_config = TransformersModelConfig(
            pretrained="Qwen/Qwen2.5-0.5B-Instruct",
            accelerator=accelerator,
            dtype="float16",
            use_chat_template=True,
    )

    task = "lighteval|gsm8k|0|1"

    pipeline = Pipeline(
        tasks=task,
        pipeline_parameters=pipeline_params,
        evaluation_tracker=evaluation_tracker,
        model_config=model_config,
    )

    pipeline.evaluate()
    # pipeline.save_and_push_results()
    pipeline.show_results()

In [12]:
run_benchmarks()

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Splits:   0%|          | 0/2 [00:00<?, ?it/s]

Greedy generation:   2%|▏         | 1/42 [00:16<11:01, 16.14s/it][A
Greedy generation:   5%|▍         | 2/42 [00:32<10:42, 16.07s/it][A
Greedy generation:   7%|▋         | 3/42 [00:48<10:25, 16.04s/it][A
Greedy generation:  10%|▉         | 4/42 [01:04<10:09, 16.03s/it][A
Greedy generation:  12%|█▏        | 5/42 [01:20<09:53, 16.03s/it][A
Greedy generation:  14%|█▍        | 6/42 [01:36<09:36, 16.02s/it][A
Greedy generation:  17%|█▋        | 7/42 [01:52<09:20, 16.01s/it][A
Greedy generation:  19%|█▉        | 8/42 [02:08<09:04, 16.01s/it][A
Greedy generation:  21%|██▏       | 9/42 [02:24<08:48, 16.01s/it][A
Greedy generation:  24%|██▍       | 10/42 [02:40<08:32, 16.02s/it][A
Greedy generation:  26%|██▌       | 11/42 [02:56<08:16, 16.02s/it][A
Greedy generation:  29%|██▊       | 12/42 [03:12<08:00, 16.03s/it][A
Greedy generation:  31%|███       | 13/42 [03:28<07:44, 16.03s/it][A
Splits:   0%|          | 0/2 [03:37<?, ?it/s]       

KeyboardInterrupt: 