Tests negative validation coverage by justindobbs · Pull Request #76 · justindobbs/Tracecore

justindobbs · 2026-03-06T19:27:12Z

Summary

What problem does this PR solve?
How does it solve it (major bullets)?

Testing

python -m pytest
python -m ruff check agent_bench
Additional targeted tests (list):

Checklist

Spec/docs updated (README, SPEC_FREEZE, changelog, etc., as needed)
New tasks/tests added to registry + SPEC if applicable
Roadmap items tracked in appropriate boards (if applicable)
Security/privacy review (if touching telemetry, signing, or bundles)
Verified CI status once pushed

…equirements - Add test_cli_tasks_validate_surfaces_manifest_validation_errors verifying CLI surfaces sandbox schema errors - Add test_registry_rejects_deterministic_manifest_without_sandbox checking missing sandbox table rejection - Add test_validate_task_path_reports_invalid_sandbox_shape verifying filesystem_roots type validation - Test deterministic tasks with invalid sandbox configurations (missing table, wrong types) - Verify error

…action execution - Add action_trace entry when SandboxViolation occurs in action execution with error result - Consume io_audit before appending trace entry to capture violation context - Include step, observation, action, budget tracking in violation trace entry - Add test_action_side_sandbox_violation_emits_correct_taxonomy_and_trace verifying trace capture - Add test_agent_side_sandbox_violation_emits_correct_taxonomy verifying agent

…_type mismatch detection - Add test_check_replay_termination_reason_mismatch verifying steps_exhausted vs tool_calls_exhausted detection - Add test_check_replay_failure_type_mismatch verifying sandbox_violation vs logic_failure detection - Set success=False and populate failure fields in baseline and fresh results for both tests - Update action_trace result to match termination/failure context in each scenario - Verify check_replay returns ok

… strict mode budget enforcement - Add test_check_record_failure_type_mismatch verifying sandbox_violation vs logic_failure detection - Add test_check_strict_steps_used_exceeded_baseline verifying steps_used regression detection - Add test_check_strict_tool_calls_used_exceeded_baseline verifying tool_calls_used regression detection - Import write_bundle and check_strict from runner modules - Create baseline bundles with write_bundle for

…d failure handling - Add test_verify_with_bundle_and_run_enforces_strict verifying check_strict is called when --strict flag is set - Add test_verify_with_bundle_and_run_strict_failure_returns_nonzero verifying non-zero exit code on strict violations - Mock _load_run_from_ref, verify_bundle, _load_cli_session, and check_strict for isolated testing - Verify check_strict receives correct bundle path and run artifact parameters - Verify strict mode failure

…ling - Add test_cmd_run_strict_spec_failure_returns_nonzero verifying non-zero exit code on spec violations - Mock _resolve_run_inputs, _run_with_timeout, persist_run, _session_after_run, _print_run_summary, and _maybe_print_star_nudge for isolated testing - Mock check_spec_compliance to return spec violation (artifact_hash missing) - Verify exit code 1 when strict_spec=True and spec check fails - Verify stderr contains "[STRICT-SPEC FAILED]

… prefer_success fallback behavior - Add test_verify_latest_stale_session_pointer_reports_missing_run verifying error handling when session points to deleted run - Add test_verify_defaults_to_latest_success_when_prefer_success verifying fallback to latest_success_run_id - Mock _latest_run_id, _load_cli_session, _load_run_from_ref, and verify_bundle for isolated testing - Verify FileNotFoundError propagates with "run artifact not found"

…ilure handling - Add test_verify_uses_session_bundle_and_reports_integrity_failure verifying error reporting when session bundle fails verification - Mock _latest_run_id, _load_run_from_ref, _load_cli_session, verify_bundle, and check_replay for isolated testing - Mock verify_bundle to return hash mismatch error for manifest.json - Set latest_bundle_dir in session to trigger bundle verification path - Verify exit code 1 when bundle integrity check fails

…test success run is missing - Add test_bundle_seal_fails_when_latest_success_run_missing verifying error handling when latest success run is deleted - Mock _latest_run_id to return "missing-success-run" and _load_run_from_ref to raise FileNotFoundError - Verify exit code 1 when run artifact cannot be found - Verify JSON output contains ok=False and "run artifact not found" error message

…nd signing failure handling - Add test_bundle_seal_reports_verify_failure_after_write verifying error reporting when bundle verification fails after write - Add test_bundle_seal_reports_sign_failure verifying error reporting when bundle signing fails - Mock _load_run_from_ref, write_bundle, verify_bundle, and _session_after_bundle for isolated testing - Mock verify_bundle to return hash mismatch error in verification test - Mock sign_bundle to return signing

… empty and mixed states - Add test_bundle_status_empty_text_reports_no_bundles verifying message when no bundles exist - Add test_bundle_status_json_reports_mixed_bundle_states verifying status reporting for valid and invalid bundles - Mock verify_bundle to return ok=True for "ok-bundle" and hash mismatch error for "bad-bundle" - Create ok-bundle with signature.json and bad-bundle without signature for mixed state testing - Verify exit

…ency ordering - Add test_bundle_status_json_respects_limit_and_recency verifying --limit flag and mtime-based sorting - Create three baseline bundles (oldest, middle, newest) with distinct modification times - Mock Path.iterdir to return _FakeDir instances with controlled mtime values (100.0, 200.0, 300.0) - Mock verify_bundle to return ok=True for all bundles - Verify --limit=2 returns only newest and middle bundles in descending m

…dling - Add test_inspect_missing_artifact_path_fails verifying error when artifact file does not exist - Add test_inspect_corrupt_artifact_json_fails verifying error when artifact contains invalid JSON - Add test_inspect_without_run_uses_default_runs_dir_and_fails_when_empty verifying error when no artifacts found in default runs directory - Verify exit code 1 for all error conditions - Verify appropriate error messages in stderr

… and LLM telemetry capture - Add test_run_artifact_runtime_metadata_has_expected_shape verifying runtime_identity, budgets, and artifact_hash structure - Add test_failure_artifact_preserves_failure_invariants verifying failure_type, termination_reason, and failure_reason fields on failed runs - Add test_action_trace_llm_telemetry_shape verifying LLM trace capture in action_trace entries - Mock load_task and load_agent to inject

justindobbs added 14 commits March 6, 2026 11:48

justindobbs merged commit 0ef5675 into main Mar 6, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests negative validation coverage#76

Tests negative validation coverage#76
justindobbs merged 14 commits intomainfrom
tests-negative-validation-coverage

justindobbs commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justindobbs commented Mar 6, 2026

Summary

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant