[GRPO][Eval] Add letter counting eval#1574
Conversation
| ): | ||
| count += 1 | ||
|
|
||
| return EvaluationResult( |
There was a problem hiding this comment.
after the recent change, just return {"count_letters": {"accuracy": count / total}}
There was a problem hiding this comment.
Should just be {"accuracy": count / total}, right?
notebooks/Oumi - Build your own Custom Evaluation (Hallucination Classifier).ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
@wizeng23 @kaisopos , I think it would be nice to have a default behavior for this / future custom evals to (1) utilize standard metrics (most of the common ones are available in sklearn) and (2) return both balanced and unbalanced metrics by default (balanced accuracy, certain kinds of F1 scores are reasonable choices in most cases). You could group by letter count or by word difficulty (this is already done for BerryBench) for the balancing. Also for this particular eval, it makes sense to also report distance-sensitive scores; cosine similarity and MAE seem like reasonable choices.
There was a problem hiding this comment.
Thanks for the feedback Ben! I plan to do a follow-up PR to support BerryBench, but started with this dataset since we already had a working dataset class for it. Wanted to get this skeleton checked in first then check with you on the specifics of the eval for a future PR.
Utility functions to generate common metrics could be useful for folks writing custom evals, but IMO we may not want to be too opionionated on custom evals so users have flexibility on how they run custom evals. We could also wait for a few more custom eval examples to be checked in to see if there's common patterns among them. I'll mostly leave it up to Kostas though for custom evals. For the letter counting eval though, happy to do whatever you think is most useful!
| # Config to eval an LLM's ability to count letters in words. | ||
| # | ||
| # Requirements: | ||
| # - Log into WandB (`wandb login`) or disable `enable_wandb` |
There was a problem hiding this comment.
It is disabled by default, right? So maybe you should say sth like:
If you want to use WandB, log in (wandb login) and enable it with enable_wandb
There was a problem hiding this comment.
I blindly copied that; it shouldn't be there, good catch! I did log a bug for wandb support for custom evals: https://linear.app/oumi/issue/OPE-1173
| return self.transform_grpo_example(sample) | ||
| return self._transform_grpo_example(sample) | ||
|
|
||
| def conversation(self, idx: int) -> Conversation: |
There was a problem hiding this comment.
conversation and conversations seem very generic.
How come they are not defined in the BaseMapDataset?
There was a problem hiding this comment.
My take is that BaseMapDataset is a much more generic base class, which is why those functions were defined in BaseSftDataset. I'm copying these functions over almost line-for-line as the GRPO dataset class needs similar behavior. I'll leave it up to Oussama if it makes sense to move it into BaseMapDataset or not.
| "function, using the decorator `@register_evaluation_function`." | ||
| ) | ||
| # Import to ensure custom evaluation functions are added to REGISTRY. | ||
| import oumi.evaluation.registry as evaluation_registry # noqa: F401 |
There was a problem hiding this comment.
How come this is not on the top of the file together with all the other imports?
Does this take too much time and we want a delayed initialization or something like that?
Is there any other reason? (it's weird to have an import inside a private function)
There was a problem hiding this comment.
This is a pattern we follow for other registry imports, ex.
oumi/src/oumi/builders/rewards.py
Line 26 in b66ff54
|
|
||
|
|
||
| def _extract_prediction(response: str) -> Optional[int]: | ||
| r"""Returns the numeric answer extracted from `\boxed{...}`, or returns None.""" |
There was a problem hiding this comment.
do you need this r in the beginning of the string of the description?
There was a problem hiding this comment.
super super nit: "Returns A or B" (instead of "Returns A, or returns B")
There was a problem hiding this comment.
yes, it's needed whenever there's a backslah \ in the text. Addressed the second point!
| number_str = regex_result[0] | ||
| # Except clause shouldn't trigger because the regex should only find ints. | ||
| try: | ||
| return int(number_str) |
There was a problem hiding this comment.
Does this work if number_str is a bool?
Does int(False) throw or is it 0?
There was a problem hiding this comment.
I would double-check this works for corner cases.
There was a problem hiding this comment.
AFAIK line 32 should always work, but I put the except clause there as a precaution. This is because the regex is explicitly only grabbing a sequence of numeric digits from the string, which should be able to convert into an int.
| ): | ||
| count += 1 | ||
|
|
||
| return {"accuracy": count / total} |
There was a problem hiding this comment.
Very debatable but just FYI:
if your letter counter fails to produce a prediction (prediction is None), then you count this as a model failure, while it could be a prompt failure (your prompt may work better for some models vs others).
An alternative (not sure which solution is more fair, maybe check with Jeremy too): If prediction is None ignore it (count does not increase but also total--)
There was a problem hiding this comment.
I think there's value in both. While part of the responsibility is on the user to prompt the model well, some models may not be good enough at instruction following to properly format the example, which is a different type of failure than the failure to count letters.
I now log both metrics in addition to some other useful numbers. Feedback on the names welcome!
Description
src/oumi/evaluation/registrydirectory for holding registered eval fnsTested on GCP.
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 74.9%deepseek-ai/DeepSeek-R1-Distill-Llama-8B: 93.4%Future work
Related issues
Towards OPE-1122
Before submitting