Add E2 reporting metrics and refresh reports#58
Conversation
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21826e9641
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if tool_style: | ||
| primary_prefix = "kube-linter/" if (fixture_type or "k8s") == "k8s" else "semgrep/" | ||
| oracle = [v for v in oracle if v.get("id", "").startswith(primary_prefix)] | ||
| predicted = [v for v in predicted if v.get("id", "").startswith(primary_prefix)] |
There was a problem hiding this comment.
Avoid dropping non-prefixed predictions
When any oracle/prediction uses tool-style IDs, tool_style becomes true and this block filters predictions to kube-linter/ or semgrep/. The model output schema allows arbitrary string IDs, so it’s valid (and common) for a model to return a bare rule_id without a tool prefix. In those cases the new filter silently discards the prediction, which can turn real false positives into “clean passes” and inflate precision/F1. Consider normalizing unprefixed IDs or only filtering when both oracle and predictions are consistently tool-prefixed.
Useful? React with 👍 / 👎.
Summary\n- add positive-only F1 and clean-pass metrics to E2 summaries and reports\n- document new metrics and schema updates\n- add batch report generator + skill docs\n- tighten env-config verification scoring and adapters\n\n## Testing\n- make check