-
Notifications
You must be signed in to change notification settings - Fork 17
Dev/steven/jb eval #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
steven10a
commented
Nov 18, 2025
- Updating docs with the new multi-turn results
- Also fixed an error in the roc_auc graphic generation that was causing different values from the tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug in ROC AUC calculation and updates benchmark documentation with new multi-turn evaluation results. The key fix changes the ROC AUC computation from using np.trapz to the correct roc_auc_score function, which was causing discrepancies between the visualizations and reported metrics.
- Corrected ROC AUC calculation method in visualization code
- Refactored data extraction to use confidence scores instead of binary values
- Updated benchmark metrics and latency results in documentation
Reviewed Changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/guardrails/evals/core/visualizer.py | Fixed ROC AUC calculation bug and improved score extraction to use confidence values |
| docs/ref/checks/jailbreak.md | Updated benchmark metrics, latency data, and ROC curve image reference |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| y_scores = [] | ||
|
|
||
| for result in results: | ||
| if guardrail_name in result.expected_triggers: | ||
| expected = result.expected_triggers[guardrail_name] | ||
| actual = result.triggered.get(guardrail_name, False) | ||
| if guardrail_name not in result.expected_triggers: | ||
| logger.warning("Guardrail '%s' not found in expected_triggers for sample %s", guardrail_name, result.id) | ||
| continue | ||
|
|
||
| y_true.append(1 if expected else 0) | ||
| y_scores.append(1 if actual else 0) | ||
| expected = result.expected_triggers[guardrail_name] | ||
| y_true.append(1 if expected else 0) | ||
| y_scores.append(self._get_confidence_score(result, guardrail_name)) | ||
|
|
||
| return y_true, y_scores | ||
|
|
||
| def _get_confidence_score(self, result: Any, guardrail_name: str) -> float: | ||
| """Extract the model-reported confidence score for plotting.""" | ||
| if guardrail_name in result.details: | ||
| guardrail_details = result.details[guardrail_name] | ||
| if isinstance(guardrail_details, dict) and "confidence" in guardrail_details: | ||
| return float(guardrail_details["confidence"]) | ||
|
|
||
| return 1.0 if result.triggered.get(guardrail_name, False) else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misinterpreting guardrail confidence as positive score
The new _extract_roc_data now feeds _get_confidence_score directly into the ROC calculator, so benign samples that the guardrail confidently classified as safe (e.g., flagged=False, confidence=0.95 in tests/unit/checks/test_jailbreak.py lines 270‑282) are now given a score of 0.95 and ranked as if they were very likely jailbreaks. The confidence field represents certainty in the decision that was made, not probability that the sample is positive, so ignoring result.triggered in _get_confidence_score inverts the ranking for the majority of benign data. This change makes the ROC curve and all downstream metrics grossly misleading because negative samples with high confidence are now interpreted as highly positive. The previous implementation at least plotted the actual binary decisions; after this commit the visualizations are outright wrong unless the score is conditioned on whether the guardrail fired (e.g., use confidence when triggered and 1 - confidence when not).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"confidence" may be a poor name for it, but our model returns 0-1 with 0 being not a jailbreak and 1 being a jailbreak. It is confidence that the content is a jailbreak. So this logic from codex is not correct.