Skip to content

Tests negative validation coverage#76

Merged
justindobbs merged 14 commits intomainfrom
tests-negative-validation-coverage
Mar 6, 2026
Merged

Tests negative validation coverage#76
justindobbs merged 14 commits intomainfrom
tests-negative-validation-coverage

Conversation

@justindobbs
Copy link
Copy Markdown
Owner

Summary

  • What problem does this PR solve?
  • How does it solve it (major bullets)?

Testing

  • python -m pytest
  • python -m ruff check agent_bench
  • Additional targeted tests (list):

Checklist

  • Spec/docs updated (README, SPEC_FREEZE, changelog, etc., as needed)
  • New tasks/tests added to registry + SPEC if applicable
  • Roadmap items tracked in appropriate boards (if applicable)
  • Security/privacy review (if touching telemetry, signing, or bundles)
  • Verified CI status once pushed

…equirements

- Add test_cli_tasks_validate_surfaces_manifest_validation_errors verifying CLI surfaces sandbox schema errors
- Add test_registry_rejects_deterministic_manifest_without_sandbox checking missing sandbox table rejection
- Add test_validate_task_path_reports_invalid_sandbox_shape verifying filesystem_roots type validation
- Test deterministic tasks with invalid sandbox configurations (missing table, wrong types)
- Verify error
…action execution

- Add action_trace entry when SandboxViolation occurs in action execution with error result
- Consume io_audit before appending trace entry to capture violation context
- Include step, observation, action, budget tracking in violation trace entry
- Add test_action_side_sandbox_violation_emits_correct_taxonomy_and_trace verifying trace capture
- Add test_agent_side_sandbox_violation_emits_correct_taxonomy verifying agent
…_type mismatch detection

- Add test_check_replay_termination_reason_mismatch verifying steps_exhausted vs tool_calls_exhausted detection
- Add test_check_replay_failure_type_mismatch verifying sandbox_violation vs logic_failure detection
- Set success=False and populate failure fields in baseline and fresh results for both tests
- Update action_trace result to match termination/failure context in each scenario
- Verify check_replay returns ok
… strict mode budget enforcement

- Add test_check_record_failure_type_mismatch verifying sandbox_violation vs logic_failure detection
- Add test_check_strict_steps_used_exceeded_baseline verifying steps_used regression detection
- Add test_check_strict_tool_calls_used_exceeded_baseline verifying tool_calls_used regression detection
- Import write_bundle and check_strict from runner modules
- Create baseline bundles with write_bundle for
…d failure handling

- Add test_verify_with_bundle_and_run_enforces_strict verifying check_strict is called when --strict flag is set
- Add test_verify_with_bundle_and_run_strict_failure_returns_nonzero verifying non-zero exit code on strict violations
- Mock _load_run_from_ref, verify_bundle, _load_cli_session, and check_strict for isolated testing
- Verify check_strict receives correct bundle path and run artifact parameters
- Verify strict mode failure
…ling

- Add test_cmd_run_strict_spec_failure_returns_nonzero verifying non-zero exit code on spec violations
- Mock _resolve_run_inputs, _run_with_timeout, persist_run, _session_after_run, _print_run_summary, and _maybe_print_star_nudge for isolated testing
- Mock check_spec_compliance to return spec violation (artifact_hash missing)
- Verify exit code 1 when strict_spec=True and spec check fails
- Verify stderr contains "[STRICT-SPEC FAILED]
… prefer_success fallback behavior

- Add test_verify_latest_stale_session_pointer_reports_missing_run verifying error handling when session points to deleted run
- Add test_verify_defaults_to_latest_success_when_prefer_success verifying fallback to latest_success_run_id
- Mock _latest_run_id, _load_cli_session, _load_run_from_ref, and verify_bundle for isolated testing
- Verify FileNotFoundError propagates with "run artifact not found"
…ilure handling

- Add test_verify_uses_session_bundle_and_reports_integrity_failure verifying error reporting when session bundle fails verification
- Mock _latest_run_id, _load_run_from_ref, _load_cli_session, verify_bundle, and check_replay for isolated testing
- Mock verify_bundle to return hash mismatch error for manifest.json
- Set latest_bundle_dir in session to trigger bundle verification path
- Verify exit code 1 when bundle integrity check fails
…test success run is missing

- Add test_bundle_seal_fails_when_latest_success_run_missing verifying error handling when latest success run is deleted
- Mock _latest_run_id to return "missing-success-run" and _load_run_from_ref to raise FileNotFoundError
- Verify exit code 1 when run artifact cannot be found
- Verify JSON output contains ok=False and "run artifact not found" error message
…nd signing failure handling

- Add test_bundle_seal_reports_verify_failure_after_write verifying error reporting when bundle verification fails after write
- Add test_bundle_seal_reports_sign_failure verifying error reporting when bundle signing fails
- Mock _load_run_from_ref, write_bundle, verify_bundle, and _session_after_bundle for isolated testing
- Mock verify_bundle to return hash mismatch error in verification test
- Mock sign_bundle to return signing
… empty and mixed states

- Add test_bundle_status_empty_text_reports_no_bundles verifying message when no bundles exist
- Add test_bundle_status_json_reports_mixed_bundle_states verifying status reporting for valid and invalid bundles
- Mock verify_bundle to return ok=True for "ok-bundle" and hash mismatch error for "bad-bundle"
- Create ok-bundle with signature.json and bad-bundle without signature for mixed state testing
- Verify exit
…ency ordering

- Add test_bundle_status_json_respects_limit_and_recency verifying --limit flag and mtime-based sorting
- Create three baseline bundles (oldest, middle, newest) with distinct modification times
- Mock Path.iterdir to return _FakeDir instances with controlled mtime values (100.0, 200.0, 300.0)
- Mock verify_bundle to return ok=True for all bundles
- Verify --limit=2 returns only newest and middle bundles in descending m
…dling

- Add test_inspect_missing_artifact_path_fails verifying error when artifact file does not exist
- Add test_inspect_corrupt_artifact_json_fails verifying error when artifact contains invalid JSON
- Add test_inspect_without_run_uses_default_runs_dir_and_fails_when_empty verifying error when no artifacts found in default runs directory
- Verify exit code 1 for all error conditions
- Verify appropriate error messages in stderr
… and LLM telemetry capture

- Add test_run_artifact_runtime_metadata_has_expected_shape verifying runtime_identity, budgets, and artifact_hash structure
- Add test_failure_artifact_preserves_failure_invariants verifying failure_type, termination_reason, and failure_reason fields on failed runs
- Add test_action_trace_llm_telemetry_shape verifying LLM trace capture in action_trace entries
- Mock load_task and load_agent to inject
@justindobbs justindobbs merged commit 0ef5675 into main Mar 6, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant