Skip to content

Add E2 reporting metrics and refresh reports#58

Merged
intertwine merged 1 commit intomainfrom
codex/metrics-report-refresh
Feb 4, 2026
Merged

Add E2 reporting metrics and refresh reports#58
intertwine merged 1 commit intomainfrom
codex/metrics-report-refresh

Conversation

@intertwine
Copy link
Copy Markdown
Owner

Summary\n- add positive-only F1 and clean-pass metrics to E2 summaries and reports\n- document new metrics and schema updates\n- add batch report generator + skill docs\n- tighten env-config verification scoring and adapters\n\n## Testing\n- make check

@claude
Copy link
Copy Markdown

claude Bot commented Feb 4, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21826e9641

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bench/report.py
Comment on lines +395 to +398
if tool_style:
primary_prefix = "kube-linter/" if (fixture_type or "k8s") == "k8s" else "semgrep/"
oracle = [v for v in oracle if v.get("id", "").startswith(primary_prefix)]
predicted = [v for v in predicted if v.get("id", "").startswith(primary_prefix)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dropping non-prefixed predictions

When any oracle/prediction uses tool-style IDs, tool_style becomes true and this block filters predictions to kube-linter/ or semgrep/. The model output schema allows arbitrary string IDs, so it’s valid (and common) for a model to return a bare rule_id without a tool prefix. In those cases the new filter silently discards the prediction, which can turn real false positives into “clean passes” and inflate precision/F1. Consider normalizing unprefixed IDs or only filtering when both oracle and predictions are consistently tool-prefixed.

Useful? React with 👍 / 👎.

@intertwine intertwine merged commit f60a411 into main Feb 4, 2026
7 checks passed
@intertwine intertwine deleted the codex/metrics-report-refresh branch February 4, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant