Fix semantic diff uncommitted mode by orban · Pull Request #27 · orban/intent-layer

orban · 2026-04-04T06:38:48Z

Request at 2026-04-03 23:40:20.798438 -0700 PDT m=+0.092579209
Request to https://api.github.com/graphql
Request took 292.190959ms

Summary

fix uncommitted-change mode in scripts/explain_semantic_diff.sh so staged changes are not double-counted
classify .intent-layer/* and log artifacts separately from config files so internal bookkeeping diffs do not imply config drift
add regression coverage for staged-only uncommitted diffs and internal-artifact-only diffs

Scope and assumptions

This keeps the first-version standalone CLI intact and only addresses the correctness issues found in review. It does not refactor the broader agent or eval flow.

Tests

./tests/test_explain_semantic_diff.sh
./tests/test_suggest_updates.sh

Sample output

# Semantic Diff Explainer

Range: uncommitted changes
Changed files: 1

## src/AGENTS.md

Summary: Changes are concentrated in 1 file(s) under this node.
Changed files: 1
Confidence: medium

Behavioral impact: No material behavioral change detected.
Contract impact: No contract change detected.
Internal-only signal: No clear internal-only signal.

Notes

This supersedes the earlier review state on feature/semantic-diff-fix / PR #26 with the two missing regressions included.

Automated by nightshift

…to eval harness - store run_config in checkpoints; warn on resume if config mismatches - classify errors as infra/timeout/genuine; skip genuine failures on resume - add --retry-all flag to override and re-run genuine failures too - write per-trial JSON to results/trials/ for ls-level observability - 13 new tests covering all three features Entire-Checkpoint: 7fb98ee370d1

Entire-Checkpoint: 5384ed119f2a

Entire-Checkpoint: 81278d45e27d

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

After each batch, classifies failures as infra vs genuine. If infra failures detected: checks Docker health, restarts if needed, reduces parallelism, resets circuit breaker, and retries. Max 2 retry rounds. Entire-Checkpoint: 806ec810be18

Status file (.eval-status.json) updated after each result with machine-readable state: workers, pass/fail rates, paused flag. Control dir (.eval-control/) accepts commands: pause, resume, set-workers N, skip-task <id>. Commands consumed on read. Enables Ralph Loop or any external agent to manage running evals. Entire-Checkpoint: 5f55311d2e2a

- Wire fisher_exact_test into reporter for per-task significance testing - Add _compute_recommendations() flagging ceiling/floor/infra-only tasks - Add Per-Task Analysis table and Recommendations section to markdown output - Add lib/monitor.py: polling-based eval supervisor with stall detection, Docker recovery, infra-task skipping, and worker scaling - 79 new tests across stats, reporter, and monitor modules Entire-Checkpoint: 722abb53d525

- AGENTbench loader (HuggingFace dataset) and runner (Docker-based eval) - run-agentbench CLI command with 4 conditions (none/flat/human/intent_layer) - Dynamic condition discovery in reporter (no more hardcoded condition list) - Path traversal protection in write_test_infrastructure - Docker --network none for test isolation - Thread-safe temp dirs (PID + thread ID) - Checkpoint batching (every 10 results vs every 1) - Monitor uses Reporter.INFRA_ERROR_PREFIXES (no drift) - Set-based infra result filtering (replaces fragile list.remove) - Empty dict all() guard in pre-validation Entire-Checkpoint: 815f2403cb52

Entire-Checkpoint: 94115d70169e

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

…ote docker support Three fixes from run 4 analysis: - Regression eval: only fail when a golden-passing test now fails (was requiring 100% pass rate, which is impossible when repos have 14-83 pre-existing failures in the baseline) - strip_docs: preserve README.md variants (setup.py reads them) - docker_runner: add EVAL_DOCKER_HOST support for remote x86 execution via rsync, avoiding QEMU emulation on Apple Silicon Also removes .index-cache-preserve/ (context files now generated dynamically per run). Entire-Checkpoint: 9f4d4a6248d7

UGREEN NAS (chronos) runs an rsync daemon that intercepts all rsync connections and rejects paths outside configured modules. tar piped through SSH bypasses this entirely and works reliably. Entire-Checkpoint: e86fbbd52da2

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

Prevents two issues from overnight eval runs: - SSH agent key expiry caused all workers to hang indefinitely on stale connections. Added ConnectTimeout, ServerAliveInterval, and subprocess timeout=300s so failures surface within 5 minutes. - sync_from_remote was transferring .venv dirs (4GB+) back from chronos. Added excludes for .venv, node_modules, __pycache__, etc. Entire-Checkpoint: aa17a443b57d

Replace ephemeral docker run with persistent containers (docker run -d + docker exec) so setup runs once per task instead of 3x. Add start_container, exec_in_container, stop_container, copy_into_container to docker_runner.py. Fix git "dubious ownership" error caused by macOS tar overlay changing file UIDs inside containers (CVE-2022-24765). Add safe.directory config before overlay. Include stderr/stdout tail in setup error messages. Entire-Checkpoint: c11a4a4ced3d

Nightshift-Task: semantic-diff Nightshift-Ref: https://github.com/marcus/nightshift

orban added 20 commits February 23, 2026 20:30

default eval model to sonnet for reproducibility

4aae36a

Entire-Checkpoint: 5384ed119f2a

bump default parallelism from 2 to 8 workers

b981018

Entire-Checkpoint: 81278d45e27d

add circuit breaker to skip remaining reps after repeated failures

d5bc415

Pre-validation failures trip at task level (all conditions share Docker setup), other failures trip at task+condition level. Threshold of 2 accounts for in-flight parallel workers. Entire-Checkpoint: b2252e9cb3d3

fix stale test references to renamed setup_workspace method

785c257

Entire-Checkpoint: 94115d70169e

add network parameter to run_in_docker, default to bridge

b53c614

--network none breaks setup commands that need pip install. Default to bridge (Docker's default) and expose the parameter so callers can opt into none for pure-test phases later. Entire-Checkpoint: 8118f4f3017a

fix agentbench loader split and docker shell compatibility

2d42eab

- agentbench_loader: use split="train" (dataset has no "test" split) - docker_runner: use bash instead of sh (AGENTbench setup uses `source`) Entire-Checkpoint: 0afc9c1d9169

use login shell in docker to pick up ~/.local/bin PATH

9ab8632

AGENTbench images install tools like uv to /root/.local/bin which only gets added to PATH via /etc/profile in login shells. bash -lc instead of bash -c fixes exit 127 for repos using uv. Entire-Checkpoint: 5d8c2af91c5d

pull docker images on remote host when EVAL_DOCKER_HOST is set

2c9ea2e

The pre-pull step was running `docker pull` locally even when Docker execution happens on a remote host via SSH. Now uses `ssh $host docker pull` when EVAL_DOCKER_HOST is configured. Entire-Checkpoint: cfe3bf0bccca

Add semantic diff explainer

0682873

Nightshift-Task: semantic-diff Nightshift-Ref: https://github.com/marcus/nightshift

Fix semantic diff direct-node handling

3a68906

Nightshift-Task: semantic-diff Nightshift-Ref: https://github.com/marcus/nightshift

Fix semantic diff uncommitted mode

b82925d

Nightshift-Task: semantic-diff Nightshift-Ref: https://github.com/marcus/nightshift

This was referenced Apr 15, 2026

Add semantic diff explainer #25

Closed

Fix semantic diff handling for direct node edits #26

Closed

orban merged commit bb47145 into main Apr 15, 2026

orban deleted the feature/semantic-diff-iteration-3 branch April 15, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix semantic diff uncommitted mode#27

Fix semantic diff uncommitted mode#27
orban merged 20 commits intomainfrom
feature/semantic-diff-iteration-3

orban commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orban commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope and assumptions

Tests

Sample output

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

orban commented Apr 4, 2026 •

edited

Loading