NO-JIRA: Add find-token skill and custom verification in eval framework#22
Conversation
|
|
||
| # Load the schema from test_cases.yaml so it's defined in one place. | ||
| _CASES = yaml.safe_load((Path(__file__).parent / "test_cases.yaml").read_text()) | ||
| SCHEMA: dict[str, Any] = _CASES[0]["schema"] |
There was a problem hiding this comment.
declared the schema, but forgot to use it below.
There was a problem hiding this comment.
Removed in 40f89d1 — the framework already validates via jsonschema.validate() in test_eval.py before verify functions run, so it was redundant.
|
|
||
| ## Adding a New Skill | ||
|
|
||
| Two things are needed: eval definitions (what to test) and a workspace symlink (so the agent can access the skill inside the container). |
There was a problem hiding this comment.
Do we no longer need the symlinks? If we do still need them, can we preserve this advice to create them when onboarding a new skill somewhere in this evals/README.md?
There was a problem hiding this comment.
Yes, symlinks are still needed. Restored the instructions in f33d279 — each eval skill needs a symlink under evals/workspace/skills/ pointing to the actual skill directory. run.sh dereferences them when building the container workspace.
There was a problem hiding this comment.
Symlinks are still needed. The instructions are preserved in the current version under "Adding a New Skill Eval" — each eval skill needs a symlink under evals/workspace/skills/ pointing to the actual skill directory.
|
/override ci/prow/eval |
|
@harche: Overrode contexts on behalf of harche: ci/prow/eval, ci/prow/images DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest ci/prow/images |
|
/test images |
|
Tested locally with Claude Opus 4.6 — both find-token test cases pass:
|
|
/jira no-jira |
|
@harche: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@harche: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Cali0707 can you take a look? |
| --- | ||
| name: find-token | ||
| description: Find the hidden verification token. Run the find-token script to retrieve a unique token. | ||
| allowed-tools: Bash(bash:*) |
There was a problem hiding this comment.
| allowed-tools: Bash(bash:*) | |
| allowed-tools: Bash |
This was picked up by the skill scanner, IMO since we seem to be allowed executing any bash command, we should just simplify this to Bash
There was a problem hiding this comment.
Done — simplified to allowed-tools: Bash in the squashed commit.
|
|
||
| The script returns JSON with a unique token: | ||
| ```json | ||
| {"token": "TOKEN_..."} |
There was a problem hiding this comment.
I don't think this is the output format? Maybe we an just not mention the format in the skill? Or document the real format (not sure that is necessary, claude seems good at understanding a json blob)
There was a problem hiding this comment.
Good catch. Removed the incorrect {"token": "TOKEN_..."} example. The script outputs a complex structured JSON blob, so rather than documenting the full format, the description now just says: "The script returns JSON with verification tokens embedded in a structured analysis response."
- Add find-token skill with SKILL.md and token generation script - Add custom verification support (_fn) in eval framework - Add find-token eval with both static matching and tool execution tests - Simplify skill loading to use symlinks only - Address review feedback: simplify allowed-tools, fix output docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0d963c8 to
0f732bb
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Cali0707, harche The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by #22 (comment) |
|
@Cali0707: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/override ci/prow/eval This is pre-existing failure before this PR (relates to the Docs skill) |
|
@Cali0707: Overrode contexts on behalf of Cali0707: ci/prow/eval DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/eval |
|
@harche: Overrode contexts on behalf of harche: ci/prow/eval DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@harche: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
find-tokenskill that generates random verification tokens, proving the agent can discover, execute, and return tool output_fnin test_cases.yaml for validating against runtime data (alongside existing static matching)evals/skills/find-token/serves as reference implementation for both patternsTest plan
bash evals/run.sh -k "find-token"— both test cases pass🤖 Generated with Claude Code