Skip to content

Fix semantic diff uncommitted mode#27

Merged
orban merged 20 commits intomainfrom
feature/semantic-diff-iteration-3
Apr 15, 2026
Merged

Fix semantic diff uncommitted mode#27
orban merged 20 commits intomainfrom
feature/semantic-diff-iteration-3

Conversation

@orban
Copy link
Copy Markdown
Owner

@orban orban commented Apr 4, 2026

Summary

  • fix uncommitted-change mode in scripts/explain_semantic_diff.sh so staged changes are not double-counted
  • classify .intent-layer/* and log artifacts separately from config files so internal bookkeeping diffs do not imply config drift
  • add regression coverage for staged-only uncommitted diffs and internal-artifact-only diffs

Scope and assumptions

This keeps the first-version standalone CLI intact and only addresses the correctness issues found in review. It does not refactor the broader agent or eval flow.

Tests

  • ./tests/test_explain_semantic_diff.sh
  • ./tests/test_suggest_updates.sh

Sample output

# Semantic Diff Explainer

Range: uncommitted changes
Changed files: 1

## src/AGENTS.md

Summary: Changes are concentrated in 1 file(s) under this node.
Changed files: 1
Confidence: medium

Behavioral impact: No material behavioral change detected.
Contract impact: No contract change detected.
Internal-only signal: No clear internal-only signal.

Notes

This supersedes the earlier review state on feature/semantic-diff-fix / PR #26 with the two missing regressions included.


Automated by nightshift

orban added 20 commits February 23, 2026 20:30
…to eval harness

- store run_config in checkpoints; warn on resume if config mismatches
- classify errors as infra/timeout/genuine; skip genuine failures on resume
- add --retry-all flag to override and re-run genuine failures too
- write per-trial JSON to results/trials/ for ls-level observability
- 13 new tests covering all three features

Entire-Checkpoint: 7fb98ee370d1
Entire-Checkpoint: 5384ed119f2a
Entire-Checkpoint: 81278d45e27d
Pre-validation failures trip at task level (all conditions share
Docker setup), other failures trip at task+condition level.
Threshold of 2 accounts for in-flight parallel workers.

Entire-Checkpoint: b2252e9cb3d3
After each batch, classifies failures as infra vs genuine. If infra
failures detected: checks Docker health, restarts if needed, reduces
parallelism, resets circuit breaker, and retries. Max 2 retry rounds.

Entire-Checkpoint: 806ec810be18
Status file (.eval-status.json) updated after each result with
machine-readable state: workers, pass/fail rates, paused flag.
Control dir (.eval-control/) accepts commands: pause, resume,
set-workers N, skip-task <id>. Commands consumed on read.
Enables Ralph Loop or any external agent to manage running evals.

Entire-Checkpoint: 5f55311d2e2a
- Wire fisher_exact_test into reporter for per-task significance testing
- Add _compute_recommendations() flagging ceiling/floor/infra-only tasks
- Add Per-Task Analysis table and Recommendations section to markdown output
- Add lib/monitor.py: polling-based eval supervisor with stall detection,
  Docker recovery, infra-task skipping, and worker scaling
- 79 new tests across stats, reporter, and monitor modules

Entire-Checkpoint: 722abb53d525
- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval)
- run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer)
- Dynamic condition discovery in reporter (no more hardcoded condition list)
- Path traversal protection in write_test_infrastructure
- Docker --network none for test isolation
- Thread-safe temp dirs (PID + thread ID)
- Checkpoint batching (every 10 results vs every 1)
- Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift)
- Set-based infra result filtering (replaces fragile list.remove)
- Empty dict all() guard in pre-validation

Entire-Checkpoint: 815f2403cb52
--network none breaks setup commands that need pip install.
Default to bridge (Docker's default) and expose the parameter
so callers can opt into none for pure-test phases later.

Entire-Checkpoint: 8118f4f3017a
- agentbench_loader: use split="train" (dataset has no "test" split)
- docker_runner: use bash instead of sh (AGENTbench setup uses `source`)
Entire-Checkpoint: 0afc9c1d9169
AGENTbench images install tools like uv to /root/.local/bin which
only gets added to PATH via /etc/profile in login shells. bash -lc
instead of bash -c fixes exit 127 for repos using uv.

Entire-Checkpoint: 5d8c2af91c5d
…ote docker support

Three fixes from run 4 analysis:

- Regression eval: only fail when a golden-passing test now fails (was
  requiring 100% pass rate, which is impossible when repos have 14-83
  pre-existing failures in the baseline)
- strip_docs: preserve README.md variants (setup.py reads them)
- docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution
  via rsync, avoiding QEMU emulation on Apple Silicon

Also removes .index-cache-preserve/ (context files now generated
dynamically per run).

Entire-Checkpoint: 9f4d4a6248d7
UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync
connections and rejects paths outside configured modules. tar piped
through SSH bypasses this entirely and works reliably.

Entire-Checkpoint: e86fbbd52da2
The pre-pull step was running `docker pull` locally even when Docker
execution happens on a remote host via SSH. Now uses `ssh $host docker
pull` when EVAL_DOCKER_HOST is configured.

Entire-Checkpoint: cfe3bf0bccca
Prevents two issues from overnight eval runs:
- SSH agent key expiry caused all workers to hang indefinitely on
  stale connections. Added ConnectTimeout, ServerAliveInterval, and
  subprocess timeout=300s so failures surface within 5 minutes.
- sync_from_remote was transferring .venv dirs (4GB+) back from
  chronos. Added excludes for .venv, node_modules, __pycache__, etc.

Entire-Checkpoint: aa17a443b57d
Replace ephemeral docker run with persistent containers (docker run -d +
docker exec) so setup runs once per task instead of 3x. Add
start_container, exec_in_container, stop_container, copy_into_container
to docker_runner.py.

Fix git "dubious ownership" error caused by macOS tar overlay changing
file UIDs inside containers (CVE-2022-24765). Add safe.directory config
before overlay. Include stderr/stdout tail in setup error messages.

Entire-Checkpoint: c11a4a4ced3d
Nightshift-Task: semantic-diff
Nightshift-Ref: https://github.com/marcus/nightshift
Nightshift-Task: semantic-diff

Nightshift-Ref: https://github.com/marcus/nightshift
@orban orban merged commit bb47145 into main Apr 15, 2026
@orban orban deleted the feature/semantic-diff-iteration-3 branch April 15, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant