chore: add tool call eval example#225
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 36 minutes and 41 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughIntroduces a new "Tool Call Evaluation" example consisting of documentation, configuration, and evaluation dataset files. Defines the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
examples/02_metrics/tool_evaluation/system.yaml (1)
8-13: Considerthreshold: 0.5for the binarycustom:tool_evalmetric.Tool eval emits exactly
0.0or1.0. A0.5threshold cleanly partitions fail vs. pass and is the convention used elsewhere in the project for this binary metric.threshold: 1works only if the comparison is>=; if it ever changes to strict>, a passing1.0would be misclassified.♻️ Proposed change
custom:tool_eval: - threshold: 1 + threshold: 0.5 description: Binary validation of tool calls (exact/regex matching)Based on learnings: "For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/02_metrics/tool_evaluation/system.yaml` around lines 8 - 13, Change the binary metric configuration for custom:tool_eval by setting its threshold from 1 to 0.5 so that tool_eval scores of 0.0 are treated as fail and 1.0 as pass; update the threshold field under the custom:tool_eval block (replace threshold: 1 with threshold: 0.5) to match the project convention for binary metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/02_metrics/tool_evaluation/eval_data.yaml`:
- Around line 82-90: The example YAML currently nests the two tool approaches as
a single alternative sequence, causing evaluate_tool_calls to treat them as one
2-step expectation; fix by making each approach its own top-level alternative in
expected_tool_calls so the parser yields two alternatives (e.g., outer list
contains two entries, each a list of step-lists), ensuring evaluate_tool_calls
(src/lightspeed_evaluation/core/metrics/custom/tool_eval.py) can match either
[{kubectl_get}] OR [{oc_get}] when full_match/ordered are true.
In `@examples/02_metrics/tool_evaluation/README.md`:
- Around line 30-35: The expected_tool_calls YAML alternatives are missing one
nesting level; the matcher/validator expects each alternative as a
sequence-of-steps (list[list[dict]]), but the current examples provide
list[dict]; update the expected_tool_calls values so each alternative is wrapped
in an additional list (e.g., change entries under expected_tool_calls to
[[{...}]] instead of [{...}]) so the structure matches the list[list[dict]]
schema.
---
Nitpick comments:
In `@examples/02_metrics/tool_evaluation/system.yaml`:
- Around line 8-13: Change the binary metric configuration for custom:tool_eval
by setting its threshold from 1 to 0.5 so that tool_eval scores of 0.0 are
treated as fail and 1.0 as pass; update the threshold field under the
custom:tool_eval block (replace threshold: 1 with threshold: 0.5) to match the
project convention for binary metrics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6f22a458-7153-4a10-af5d-9e6d0127bf6c
📒 Files selected for processing (3)
examples/02_metrics/tool_evaluation/README.mdexamples/02_metrics/tool_evaluation/eval_data.yamlexamples/02_metrics/tool_evaluation/system.yaml
7faf0f3 to
28775ce
Compare
Description
chore: add tool call eval example
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit