-
Notifications
You must be signed in to change notification settings - Fork 6k
Added model summary and risk assessment for commands that violate sandbox policy #5536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…dbox policy and require user approval
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after @jif-oai 's comments.
* Moved prompt into its own file and switched it to use askama for templating * Refactored sandbox_retry_data trait for simplification * Fixed otel telemetry so assessment conversation doesn't appear as a new task * Added otel telemetry point for recording latency of assessment * Removed defensive JSON parsing of assessment response Removed new experimental config key from public documentation for now. We're going to roll this out internally first to get feedback.
* Simplified config handling by leveraging "features" mechanism * Moved approvals-related schemas from protocol.rs to simplify
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
This PR adds support for a model-based summary and risk assessment for commands that violate the sandbox policy and require user approval. This aids the user in evaluating whether the command should be approved.
The feature works by taking a failed command and passing it back to the model and asking it to summarize the command, give it a risk level (low, medium, high) and a risk category (e.g. "data deletion" or "data exfiltration"). It uses a new conversation thread so the context in the existing thread doesn't influence the answer. If the call to the model fails or takes longer than 5 seconds, it falls back to the current behavior.
For now, this is an experimental feature and is gated by a config key
experimental_sandbox_command_assessment.Here is a screen shot of the approval prompt showing the risk assessment and summary.