[GRPO][Eval] Add letter counting eval by wizeng23 · Pull Request #1574 · oumi-ai/oumi

wizeng23 · 2025-03-27T01:40:32Z

Description

Add eval config and GCP job to count letters
Add functions to enable GRPO datasets to output conversations
Delete unneeded args for GRPO dataset init
Add custom evaluation function, and registered it
- Created src/oumi/evaluation/registry directory for holding registered eval fns
Add system prompt to improve eval performance
Add version restrictions to vllm installs. The latest version 0.8.2 requires torch 2.6.0 while we're on 2.5.

Tested on GCP.

Accuracy for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 74.9%
- Without system prompt: 32.7%
Accuracy for deepseek-ai/DeepSeek-R1-Distill-Llama-8B: 93.4%
- Without system prompt: 42.4%

Future work

Test remote inference
Support berry bench dataset
Run both datasets against most popular models

Related issues

Towards OPE-1122

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

kaisopos · 2025-03-27T02:03:40Z

src/oumi/evaluation/registry/count_letters_task.py

+        ):
+            count += 1
+
+    return EvaluationResult(


after the recent change, just return {"count_letters": {"accuracy": count / total}}

Should just be {"accuracy": count / total}, right?

notebooks/Oumi - Build your own Custom Evaluation (Hallucination Classifier).ipynb

src/oumi/evaluation/registry/count_letters_task.py

penfever · 2025-03-27T17:08:47Z

src/oumi/evaluation/registry/count_letters_task.py

@wizeng23 @kaisopos , I think it would be nice to have a default behavior for this / future custom evals to (1) utilize standard metrics (most of the common ones are available in sklearn) and (2) return both balanced and unbalanced metrics by default (balanced accuracy, certain kinds of F1 scores are reasonable choices in most cases). You could group by letter count or by word difficulty (this is already done for BerryBench) for the balancing. Also for this particular eval, it makes sense to also report distance-sensitive scores; cosine similarity and MAE seem like reasonable choices.

Thanks for the feedback Ben! I plan to do a follow-up PR to support BerryBench, but started with this dataset since we already had a working dataset class for it. Wanted to get this skeleton checked in first then check with you on the specifics of the eval for a future PR.

Utility functions to generate common metrics could be useful for folks writing custom evals, but IMO we may not want to be too opionionated on custom evals so users have flexibility on how they run custom evals. We could also wait for a few more custom eval examples to be checked in to see if there's common patterns among them. I'll mostly leave it up to Kostas though for custom evals. For the letter counting eval though, happy to do whatever you think is most useful!

kaisopos · 2025-03-28T20:02:11Z

configs/examples/letter_counting/evaluation/eval.yaml

+# Config to eval an LLM's ability to count letters in words.
+#
+# Requirements:
+#   - Log into WandB (`wandb login`) or disable `enable_wandb`


It is disabled by default, right? So maybe you should say sth like:
If you want to use WandB, log in (wandb login) and enable it with enable_wandb

I blindly copied that; it shouldn't be there, good catch! I did log a bug for wandb support for custom evals: https://linear.app/oumi/issue/OPE-1173

configs/examples/letter_counting/evaluation/eval.yaml

configs/examples/letter_counting/evaluation/gcp_job.yaml

kaisopos · 2025-03-28T20:09:20Z

src/oumi/core/datasets/base_grpo_dataset.py

-        return self.transform_grpo_example(sample)
+        return self._transform_grpo_example(sample)
+
+    def conversation(self, idx: int) -> Conversation:


conversation and conversations seem very generic.
How come they are not defined in the BaseMapDataset?

My take is that BaseMapDataset is a much more generic base class, which is why those functions were defined in BaseSftDataset. I'm copying these functions over almost line-for-line as the GRPO dataset class needs similar behavior. I'll leave it up to Oussama if it makes sense to move it into BaseMapDataset or not.

kaisopos · 2025-03-28T20:14:10Z

src/oumi/core/evaluation/evaluator.py

                "function, using the decorator `@register_evaluation_function`."
            )
+        # Import to ensure custom evaluation functions are added to REGISTRY.
+        import oumi.evaluation.registry as evaluation_registry  # noqa: F401


How come this is not on the top of the file together with all the other imports?
Does this take too much time and we want a delayed initialization or something like that?
Is there any other reason? (it's weird to have an import inside a private function)

This is a pattern we follow for other registry imports, ex.

oumi/src/oumi/builders/rewards.py

Line 26 in b66ff54

import oumi.datasets.grpo.rewards as grpo_rewards # noqa: F401

. I believe this is to reduce time for importing things from the registry, which was slowing down CLI runtime before. This lazy loading pattern gets around this. I think Matthew would know best here.

kaisopos · 2025-03-28T20:17:23Z

src/oumi/evaluation/registry/count_letters_task.py

+
+
+def _extract_prediction(response: str) -> Optional[int]:
+    r"""Returns the numeric answer extracted from `\boxed{...}`, or returns None."""


do you need this r in the beginning of the string of the description?

super super nit: "Returns A or B" (instead of "Returns A, or returns B")

yes, it's needed whenever there's a backslah \ in the text. Addressed the second point!

kaisopos · 2025-03-28T20:20:49Z

src/oumi/evaluation/registry/count_letters_task.py

+    number_str = regex_result[0]
+    # Except clause shouldn't trigger because the regex should only find ints.
+    try:
+        return int(number_str)


Does this work if number_str is a bool?
Does int(False) throw or is it 0?

I would double-check this works for corner cases.

AFAIK line 32 should always work, but I put the except clause there as a precaution. This is because the regex is explicitly only grabbing a sequence of numeric digits from the string, which should be able to convert into an int.

src/oumi/evaluation/registry/count_letters_task.py

kaisopos · 2025-03-28T20:27:38Z

src/oumi/evaluation/registry/count_letters_task.py

+        ):
+            count += 1
+
+    return {"accuracy": count / total}


Very debatable but just FYI:
if your letter counter fails to produce a prediction (prediction is None), then you count this as a model failure, while it could be a prompt failure (your prompt may work better for some models vs others).
An alternative (not sure which solution is more fair, maybe check with Jeremy too): If prediction is None ignore it (count does not increase but also total--)

I think there's value in both. While part of the responsibility is on the user to prompt the model well, some models may not be good enough at instruction following to properly format the example, which is a different type of failure than the failure to count letters.

I now log both metrics in addition to some other useful numbers. Feedback on the names welcome!

wizeng23 added 11 commits March 19, 2025 17:02

merge main

cb7ee60

Initial letter counting eval

74310ec

merge main

a8b3e3d

a

44032ac

merge main

d3f8571

merge main

e666fd1

Add conversation conversion support to Letter count dataset

8c7b566

a

bf4eecc

Mostly finalized

7465403

Got everything working e2e

85133a3

final

a5d37c6

wizeng23 requested review from jgreer013, kaisopos, nikg7, oelachqar, penfever and taenin March 27, 2025 01:40

kaisopos reviewed Mar 27, 2025

View reviewed changes

notebooks/Oumi - Build your own Custom Evaluation (Hallucination Classifier).ipynb Outdated Show resolved Hide resolved

kaisopos reviewed Mar 27, 2025

View reviewed changes

src/oumi/evaluation/registry/count_letters_task.py Outdated Show resolved Hide resolved

wizeng23 added 2 commits March 26, 2025 20:29

a

1371c4e

Address changes

7432a18

penfever reviewed Mar 27, 2025

View reviewed changes

Add header

7ea3d7e

kaisopos reviewed Mar 28, 2025

View reviewed changes

wizeng23 added 3 commits March 28, 2025 14:44

merge main

e5854d6

Address comments

55dcc0f

Fix vllm installs

18dff8d

wizeng23 mentioned this pull request Mar 31, 2025

Add KTO support for preference tuning #1538

Merged

4 tasks

merge main

46a8f9c

kaisopos approved these changes Mar 31, 2025

View reviewed changes

wizeng23 merged commit eae46dd into main Mar 31, 2025
1 of 2 checks passed

wizeng23 deleted the wizeng/o1122-letter-count branch March 31, 2025 21:37

penfever pushed a commit that referenced this pull request Aug 27, 2025

[GRPO][Eval] Add letter counting eval (#1574)

d4328e6



		def _extract_prediction(response: str) -> Optional[int]:
		r"""Returns the numeric answer extracted from `\boxed{...}`, or returns None."""

Comments

Conversation

wizeng23 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Before submitting

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

penfever Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wizeng23 commented Mar 27, 2025 •

edited

Loading

penfever Mar 27, 2025 •

edited

Loading