Skip to content

Dev#8

Merged
im4codes merged 151 commits intomasterfrom
dev
Apr 21, 2026
Merged

Dev#8
im4codes merged 151 commits intomasterfrom
dev

Conversation

@im4codes
Copy link
Copy Markdown
Owner

No description provided.

IM.codes and others added 30 commits April 17, 2026 18:34
Template-prompt filter (recall-only): excludes built-in OpenSpec / P2P /
slash-command / skill-template prompts from memory recall via the shared
`isTemplatePrompt` / `isTemplateOriginSummary` predicates. Locale-aware
across all 7 supported UI languages (en, zh-CN, zh-TW, es, ru, ja, ko) and
covers every `openspec.*_prompt` + `p2p.*_prompt` built-in template, the
`P2P_BASELINE_PROMPT`, `roundPrompt()` headers, harness `<command-name>`
tags, and slash-command / plugin-namespaced skill invocations.

Recall cap rule: `RECALL_MIN_FLOOR = 0.5`, `RECALL_DEFAULT_CAP = 3`,
`RECALL_EXTEND_BAR = 0.6`, `RECALL_EXTEND_CAP = 5`. Drop below floor; take
top 3; extend to 5 iff every top-3 item clears 0.6. Applied at process
`prependLocalMemory`, transport `buildTransportMessageRecall`, and server
`POST /memory/recall`.

Per-session de-dup: daemon-side LRU of 10 past injection events keyed by
`sessionKey`; prevents re-injecting the same memory across consecutive
turns of the same session. Cleared on `session.clear` (both transport and
process paths) and on `TransportSessionRuntime.kill()`. Server endpoint
does not apply this — it has no per-session context.

Hit-count credit: only for items that actually entered the prompt
(survived floor + LRU + cap). Items dropped upstream no longer receive a
spaced-repetition credit.

Intentional scope boundaries:
- Ingestion / materialization is NOT filtered — template events remain
  part of the project's recorded history.
- Startup bootstrap (`selectStartupMemoryItems`) is NOT filtered — it is
  project-scoped memory load, not a query-driven recall.
- CLI `imcodes memory` / WS `memory.search` / web UI browsing are NOT
  capped — they use client-supplied explicit limits.

Tests: 158 added or updated (template patterns × 7 locales, recall cap
rule, injection history LRU, server recall endpoint rewrites for new
semantics, materialization coordinator reverse-pin asserting template
content is still recorded).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y-idle session

Why: user reported "idle 后依旧不触发任何动作和效果" — enabling Auto on a
session whose last turn had already finished left supervision dormant
forever. Two gaps in the idle-boundary handler combined:

1. `handleTimelineEvent` deleted `recentTaskCandidates` on every idle
   where snapshot was null/off, so by the time the user flipped Auto on
   there was no candidate left to evaluate.
2. `applySnapshotUpdate` only mutated existing active runs; it never
   kicked off an implicit run for a just-enabled session, so the
   already-completed turn sat there unevaluated until the next user
   message — which is "nothing happens" from the user's perspective.

Fix: preserve the candidate when supervision is off at idle (only drop
it when supervision is ON but preconditions fail — those can't
self-heal), and make `applySnapshotUpdate` re-run the same implicit-
trigger preconditions as the idle path when a freshly-enabled snapshot
arrives against a dormant session.

Regression guard: `test/daemon/supervision-idle-integration.test.ts`
wires the real `timelineEmitter` + real `supervisionAutomation` through
the `handleWebCommand('session.send')` → idle → broker path with only
the broker/runtime/store mocked, so the production seam is actually
exercised. Previous coverage mocked `supervisionAutomation` wholesale
and never would have caught this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IM.codes and others added 29 commits April 20, 2026 15:18
…trawman

The previous hero sentence had two problems:

1. "instead of relying on blind auto-continue" framed supervision as
   replacing a fictitious default. IM.codes sessions do NOT auto-continue
   by default — the baseline is manual input every turn. Supervision
   ADDS gated continuation, it doesn't replace blind continuation.

2. "supported transport-backed agent sessions" read as if only a subset
   of transports were covered, even though
   SUPERVISION_SUPPORTED_TARGET_SESSION_TYPES = TRANSPORT_SESSION_AGENT_TYPES
   and recent commit 46e79a5 extended coverage to qwen presets. The
   doubled qualifier also hid the actual leverage: you write the
   supervisor's instructions.

The new hero leads with the custom-instructions differentiator, names
all three decision outcomes (auto-continue / hand back / audit loop),
pins the evaluation point at the idle boundary, and contrasts with the
honest baseline (manual "continue" loops) rather than a strawman.

Synced across all 7 README files (EN + ES / JA / KO / RU / zh-CN / zh-TW).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production daemon on a self-hosted deployment grew from 600MB to 3.3GB in
~40 minutes and OOM-crashed. Observed on the leaking instance:

  - 7 parallel ESTAB WS connections from the daemon to the server
    (should be exactly 1)
  - 10 orphan `cat /tmp/imcodes-pty-<pid>-XXXX/stream.fifo` child
    processes (pipe-pane readers that were never reaped)
  - sessions.json truncated to 0 bytes by a write interrupted by OOM
  - daemon.log flapping between "ServerLink: not connected" and
    "write EPIPE" for several minutes prior

## Bug A — ServerLink.connect() leaks previous WebSocket

`connect()` overwrote `this.ws` without closing the old WebSocket. The
stale-check guards (`if (this.ws !== ws) return`) in the open / message /
close handlers correctly drop handler-level events for the old socket,
but nothing ever calls `close()` on it. The OS keeps the TCP socket
ESTAB for minutes until network timeout, and the Node WebSocket
instance keeps its internal buffers, TLS state, and event emitter
closures alive the whole time. Every `scheduleReconnect()` cycle under
error/close flapping added another live WS on top of the previous ones.

`forceReconnect()` already closes the old ws before scheduling a
reconnect — the error/close → `scheduleReconnect()` → `connect()` path
did not. This commit adds the same close to the top of `connect()`.

## Bug B — terminal-streamer.handlePipeClose leaks `cat` subprocess

`handlePipeClose()` is called from the stream's unexpected `error` /
`close` events. It deleted the pipeState tracking entry and scheduled a
rebind, but it did NOT:

  - call `stream.destroy()` — so the Node readable side kept buffering
    incoming FIFO data with no consumer
  - call `pipeState.cleanup()` — so provider-side resources kept
    accumulating
  - call `stopPipePaneStream(sessionName)` — so tmux kept the
    underlying `cat /tmp/.../stream.fifo` child running, feeding
    indefinitely-buffered data into the daemon

Each subsequent rebind spawned another `cat`; the previous one was
never reaped. With ~10 FIFO readers each feeding a few hundred KB/s
into the daemon with no drain, Node's readable buffer grew unbounded
— consistent with the ~425MB/min growth we observed.

This matches `stopPipe()`, which already does all three cleanup steps;
the unexpected-close path just needs the same teardown before it
schedules a rebind.

## Tests

- server-link.test.ts: regression — `connect()` called twice in a row
  must `close()` the previous WebSocket instance before creating the
  new one.
- terminal-streamer-snapshot.test.ts: regression — triggering
  `stream.on('close')` after a successful pipe start must invoke
  `stream.destroy()`, the pipeState `cleanup()` closure, and
  `stopPipePaneStream(sessionName)`.
- Daemon unit suite: 2271 pass / 0 fail.
- Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w JSON

Production screenshot on the iPhone client had every codex `WebSearch`
tool call rendering as literal `{"query":"","action":{"type":"other"}}`
in the chat row. Two compounding bugs:

## Daemon-side (src/agent/providers/codex-sdk.ts)

`toolFromItem` for the `webSearch` case was putting BOTH the extracted
query AND the raw `action` object into the flat `input` payload:

    input: { query: effectiveQuery, ..., action }

When Codex emits a `webSearch` item with no resolvable query
(`item/started` before the search is picked, or `item/completed` with
`action.type: 'other'`), `effectiveQuery` fell through to `''`. That
leaves `input = { query: '', action: { type: 'other' } }` — two keys
where the first is an empty string.

## UI-side (web/src/components/ChatView.tsx summarizeToolInput)

`formatToolPayloadValue` walks `TOOL_INPUT_SUMMARY_KEYS` in order
(`query`, `command`, `cmd`, `path`, ...). It treats an empty-string
value as "not useful" and keeps scanning for another key; none of the
other keys match; `entries.length !== 1` so it falls through to
`JSON.stringify(value)`. The raw shape `{"query":"","action":{"type":"other"}}`
ended up stamped directly into the chat row.

## Fix

Strip raw `action` out of the flat `input` payload and force `query` to
a non-empty human-readable label derived from the best available signal
(top-level query → action.query → pattern → url → `(<actionType>)`).
The expand/detail panel still gets the full raw `action` via
`detail.input` and `detail.raw`.

Result:
- `{ action: { type: 'search', query: 'nvidia a100' } }` → row reads "WebSearch nvidia a100"
- `{ action: { type: 'open_page', url: '...' } }`       → row reads "WebSearch <url>"
- `{ action: { type: 'find_in_page', pattern: '...' } }` → row reads "WebSearch <pattern>"
- `{ action: { type: 'other' } }`                        → row reads "WebSearch (other)"
- `item/started` before any action materializes           → row reads "WebSearch (other)" or "(web_search)"

## Tests

- Extended existing `falls back to action url/pattern/type for non-search
  WebSearch actions` to also assert `input.query` is non-empty AND that
  `input.action` is not present (the root cause of the rendering bug).
- New `WebSearch started lifecycle with no action surfaces a readable
  label` — covers the exact payload shape from the production screenshot.
- Full daemon unit suite: 2272 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the HTTP /timeline/history/full backfill only fired on WS
`session.event connected` — i.e., WebSocket reconnects. That left a
subtle gap: opening a session window while the WS is already connected
(session switch, reopening a minimized pane, returning to a tab after
background throttling, opening a sub-session window while the parent
was already alive) rendered from memory cache / IDB / WS replay without
any authoritative daemon-side check. Events written by the daemon while
the window wasn't visible could be missed until the next full WS
reconnect — potentially never, on long-lived connections.

User-visible request: "每次打开 都背景拉一下吧".

## Change

Extracted the inline reconnect-path backfill in `useTimeline` into a
single reusable `fireHttpBackfill(delayMs)` helper and call it from all
three mount paths (memory cache hit, already-loaded short-circuit, cold
IDB-backed load) with a short ~200ms delay so the UI renders from cache
first and the network read is strictly additive.

The existing reconnect call now also routes through the helper (600ms
delay retained — the reconnect case has an extra race-settling concern
that a session mount doesn't share).

## Safety

- Unchanged dedup path: `mergeTimelineEvents` is eventId-keyed so a WS
  event and its HTTP-recovered twin collapse to one.
- Cursor computed at fire time (not call time), so events that arrive
  between mount and the 200ms tick don't get redownloaded.
- cacheKey guard: if the user switches sessions during the delay window
  the timeout no-ops.
- Skipped entirely when `serverId` is unknown (self-hosted pod-sticky
  routing requires it).

The helper deps through five values (serverId, sessionId, cacheKey,
mergeEvents, idbPutEvents); stored in `fireHttpBackfillRef` so the
mount effect and WS-message effect can call the latest version without
having to list all five in their own dep arrays (which would cause
spurious effect re-runs).

## Tests

- New `fires HTTP backfill on session mount (memory-cache path) even
  without a WS reconnect` — the exact regression this change targets.
- Updated the existing reconnect and failure-swallow tests to drain
  the mount-time backfill before asserting the reconnect-time fire.
- Full web suite: 848 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-ups to a38da6f (HTTP timeline backfill on session mount):

## 1) 60s cooldown on mount-time backfill

Flicking session A → B → A → B within a minute used to fire a fresh
HTTP read on every visit. Now `fireHttpBackfill(delayMs, { cooldownMs })`
checks a module-level `lastHttpBackfillOkAt` map keyed by cacheKey;
mount callers pass `cooldownMs: 60_000`. The cooldown is armed only on
a non-null response (`{events: []}` counts as confirmed-no-gap); null
or rejected responses leave the stamp untouched so the next mount
retries promptly.

## 2) WS reconnect deliberately bypasses the cooldown

Reconnects imply a real gap where live events were probably dropped
by the bridge's subscribe-race window. Suppressing the reconnect
backfill would defeat its purpose — the reconnect call site stays
cooldown-free.

## 3) Reset cooldowns when the app reopens after a long hide

Module state survives across mounts, so without this the 60s cooldown
would suppress the backfill even after backgrounding a PWA for hours.
A `visibilitychange` listener tracks hiddenAt; on return-to-visible
with `hidden_duration >= 60s` the map is wiped. Short blurs (alt-tab
to Slack for 5s) leave it intact. `pageshow` also wipes on bfcache
restore (`event.persisted`). Guarded with `typeof document !== 'undefined'`
so hook still imports under SSR / vitest node env.

## Tests

- `skips the mount-time backfill when the same session was successfully
  backfilled in the last 60 seconds`
- `app-reopen wipe ... clears the cooldown so the next mount fires fresh`
- `reconnect-path backfill bypasses the mount cooldown`
- Full web suite: 851 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…phan cat processes

Second leak path missed by 20440d7. After 57 minutes on a production
daemon we observed 11 orphan `cat /tmp/imcodes-pty-*/stream.fifo`
children, all spawned in a 15-second window during a network flap.
The log had 215 "Pipe-pane stream started" lines but 0 close /
rebind events — meaning these pipes were spawned, never hit the
cleanup path, and still accumulated cats.

## Root cause

`startPipe` is async and assigns `this.pipes.set(sessionName, ...)`
only AFTER `await startPipePaneStream(...)` resolves. Between the
outer-most guards (`isTransportSessionName` etc.) and that assignment
there's a ~50–300ms window where `this.pipes.has(sessionName)` still
returns false.

When a network flap causes two web clients to reconnect in the same
tick and both subscribe to the same session, two `bootstrapSubscriber`
calls each check `this.pipes.has(sessionName) === false` and each call
`startPipe`. Both then spawn their own `cat` via `startPipePaneStream`.
First to finish writes its `PipeState` into the map; second's
`this.pipes.set(sessionName, …)` **overwrites** it. The first
`PipeState` is now orphaned — its `cat` keeps running and feeding
bytes into a Node stream that `handlePipeClose` can never find (the
map lookup returns the SECOND pipeState, not the one whose stream
just closed). The leak compounds over each reconnect storm.

Same shape matches the earlier 10-cat-orphan / 425MB/min OOM pattern —
except this one is *not* fixed by 20440d7's handlePipeClose cleanup
because the map entry was overwritten, not gracefully handed off.

## Fix

Introduce `pipeStartLocks: Set<string>` on the streamer. At the top of
`startPipe` (after the transport-session short-circuit), bail if the
session already has a pipe OR a start is in flight:

    if (this.pipes.has(sessionName) || this.pipeStartLocks.has(sessionName)) {
      return;
    }
    this.pipeStartLocks.add(sessionName);
    try { ... } finally { this.pipeStartLocks.delete(sessionName); }

The guard covers both shapes of the race:
- "Already have a live pipe" (second subscribe after first completed)
- "First start still awaiting startPipePaneStream" (both subscribes
  in the same tick)

Legitimate retry paths are unaffected:
- `rebindSession` explicitly calls `stopPipe` (clears `pipes`) before
  `startPipe`.
- `scheduleRebind` fires only after `handlePipeClose` removed the
  dead entry.
- `retryPipeIfSubscribers` already bails if a pipe exists.

## Test

New regression test mocks `startPipePaneStream` with a gated promise
(first caller awaits, subsequent callers must be dropped by the
guard). Two concurrent `subscribe()` calls for the same session now
produce exactly ONE `startPipePaneStream` invocation. Previously they
produced two.

Full daemon suite: 2273 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SessionRecord was rebuilt from opts only on every transport
(re)launch, so any caller that forgot ccPreset — rebuildSubSessions,
provider auto-reconnect, P2P helper clone — silently wiped the preset
and Qwen reverted to the OAuth coder-model placeholder.

Resolve ccPreset/userCreated/parentSession/recentInjectionHistory from
the existing record when opts doesn't override (opts.fresh still wins),
same pattern already used for transportConfig and startupMemoryInjected.
Both Qwen and claude-code-sdk preset branches now use effectiveCcPreset
so the defense holds even if a new caller omits it.

Belt-and-suspenders in subsession-manager: startSubSession and
rebuildSubSessions now pass ccPreset explicitly, and the non-transport
rebuild upsert carries ccPreset/description/userCreated/memory-dedup
state forward so daemon restart no longer resets them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iles into memory

Root cause of the 80MB/min sustained RSS growth on a long-running
production daemon (pid 4074851 on 211, weeks of runtime):

  /home/k/.imcodes/transport/deck_sub_2f4d1346.jsonl → 170 MB
  /home/k/.imcodes/transport/deck_sub_25504q48.jsonl →  67 MB
  /home/k/.imcodes/transport/deck_sub_6u2l3o0j.jsonl →  65 MB
  (transport/ total: 449 MB across 39 session files)

`replayTransportHistory(sessionId)` called `readFile(path, 'utf8')`
to slurp the entire file — then kept only the last 200 lines via
`content.trim().split('\n').slice(-MAX_REPLAY_LINES)`. On a 170 MB
file that allocates:

  - ~170 MB raw utf-8 string (V8 stores as UTF-16 ≈ 340 MB)
  - a full per-line array (often another ~340 MB)
  - an intermediate `.trim()` copy

per browser subscribe / session resume. With ~3 simultaneous
subscribes + the new mount-time HTTP backfill firing history replays
every time a window opens (commit a38da6f), this compounded into
multi-GB transient V8 allocations and 80 MB/min sustained RSS growth
as GC couldn't keep up.

Also explains why a fresh local dev box never reproduced it: session
JSONLs only reach tens of MB after days of real use.

## Fix

Replace the `readFile` path with a bounded tail read:

  fs.open → stat → read last 1 MiB → drop the first (partial) line →
  split → slice(-200) → parse → fh.close()

1 MiB is enough headroom for 200 tail lines even on sessions with
~5 KB tool-output payloads per event. The allocation ceiling is now
O(1) in file size: ~1 MiB buffer + a 200-entry array regardless of
whether the source file is 10 MB or 10 GB.

Also adds an explicit `fh.close()` in a `finally` — previously
`readFile` closed the fd implicitly, but with a manual `open` we
must release it ourselves to avoid the per-replay fd leak we
observed via `sudo lsof -p <daemon>` (the same file was held at 4
separate fds concurrently on 211).

## Verification

- New regression test `replay stays bounded on multi-megabyte JSONL
  files (tail-read only)` writes a 5000-entry × 5 KB = ~25 MB file,
  asserts replay returns exactly the last 200 entries in order
  (lastIdx=4999, firstIdx=4800). Would have failed with the old
  full-file slurp path for memory pressure on CI if large enough,
  and succeeds trivially now that we only look at the tail.
- Daemon unit suite: 2274 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mmary

Follow-up to bc1a496. Two upgrades in the same shape-bug family
("read the whole file just to slice off the tail"):

## 1) transport-history — precise N-line reverse scan

bc1a496 replaced the full-file `readFile` with a fixed 1 MiB tail
window. That worked for typical sessions but would silently return
FEWER than 200 entries on sessions whose lines are larger than 5 KB
average (a single tool-output payload can run 5–100 KB) — 200 tail
lines × 8 KB = 1.6 MB, outside the fixed window.

Rewritten as a reverse-chunk scan:
  - `open` + `stat`
  - repeatedly read 64 KiB chunks backward from EOF
  - count newlines in each chunk before concatenating
  - stop as soon as we've seen MAX_REPLAY_LINES + 1 newlines (the +1
    lets us drop the partial leading line cleanly)
  - hard cap at MAX_TAIL_BYTES (16 MiB) to protect against pathological
    huge-line files
  - `finally { fh.close() }` keeps fd lifecycle explicit

Allocation is now O(min(file_size, 200·line_size, 16 MiB)) — matches
the actual "last 200 events" contract instead of a rough byte heuristic.

## 2) p2p-orchestrator — same shape at a different call site

The audit turned up a second instance at
`src/daemon/p2p-orchestrator.ts:1026`:

  fullContent = await readFile(run.contextFilePath, 'utf8');
  run.resultSummary = fullContent.slice(-2000);

A multi-round discussion across several hops can produce megabytes
of markdown; reading all of it just to slice off the last 2000 chars
is the same anti-pattern. Replaced with a bounded 2 KiB `fh.read`
from EOF, still with `finally { fh.close() }`.

## Audit summary (for other readers)

Scanned the rest of the daemon for full-file-slurp-to-tail patterns:
  - `jsonl-watcher.ts` — already bounded 256 KB tail read (safe)
  - `codex-watcher.ts` — already bounded 256 KB tail read (safe)
  - `gemini-watcher.ts` — reads small fixed Gemini session JSONs,
    bounded by CLI format (safe, severity 1)
  - `timeline-store.ts` — already uses reverse-chunk with 16 MiB cap
    (safe)
  - agent drivers' `.slice(-N)` calls operate on in-memory tmux
    capture arrays (~50 lines), not files (safe)
  - store/* JSON files are fixed small config blobs (safe)

No other leak sites in this family.

## Tests

- New `returns exactly MAX_REPLAY_LINES entries even when each line is
  large (reverse-chunk scans back as far as needed)`: 250 lines × 6 KB
  = ~1.5 MB file, asserts exactly 200 entries returned with correct
  idx range [50..249]. Would fail with the old fixed 1 MiB window.
- Previous `replay stays bounded on multi-megabyte JSONL files` still
  passes (5000×5KB fixture).
- Full daemon suite: 2275 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…retry storm

Observed on production daemon (pid 72905, 211):
  - main-thread CPU pinned 85 % sustained, daemon message dispatch feels
    noticeably laggy under load
  - logs show a self-reinforcing storm:
      "Codex SDK session is already busy" → retry
      "Compression timed out after 60000ms" → retry
      repeated every 10–15 s for minutes

Root cause: ~40 materialization targets fire on a 10 s cadence and each
`materializeTarget` invokes `compressWithSdk` independently. The shared
Codex sub-session used by the compression path only accepts ONE `send`
in flight; concurrent callers race it, the second hits
`PROVIDER_ERROR: "Codex SDK session is already busy"`, the retry loop
kicks in, meanwhile N streams of delta callbacks pile up on the main
event loop all competing with WS heartbeat / user command dispatch /
replication poller. Even though every `await` is theoretically async,
the callback fan-out (`onDelta` per stream chunk × N parallel sessions)
saturates the single JS thread.

Fix: single global compression chain in `compressWithSdk` — each
incoming call awaits the previous one before entering the inner
provider path. Releases in `finally` so a thrown / timed-out run
cannot stall the queue behind it. Callers (`materialization-coordinator`
and friends) stay fire-and-forget and naturally observe backpressure.

Also shrank `COMPRESSION_TIMEOUT_MS` 60 s → 20 s. With serial execution
the queue IS the budget, and a genuinely-broken call used to block
every subsequent compression for a full minute. 20 s still accommodates
a warm-context structured summary; slow/broken calls release the lane
3× faster and the per-backend circuit breaker trips sooner so we fall
back to the local summarizer.

## Tests

- `never runs two SDK query() calls concurrently, even with 3 callers
  firing at the same tick`: fires 3 concurrent `compressWithSdk`, mocks
  the Claude Agent SDK `query()` to hold each call 30 ms, asserts
  `peakInFlight === 1` and that every `start:` is followed by its
  matching `end:` before the next `start:`.
- `releases the lane even when the current call throws, so the queue
  does not stall`: mock throws on the first caller's stream; asserts
  the second caller still runs and never overlaps.
- Full daemon suite: 2277 pass / 0 fail. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: on system boot before tmux socket was up, `tmux list-sessions`
failed with "error connecting to /tmp/tmux-1000/default", propagated to
`program.parseAsync().catch` → `process.exit(1)` → systemd restart →
same failure. Log shows 479 fatal errors over multiple boots, all the
same recoverable tmux-not-ready case.

- `src/agent/tmux.ts`: `ensureTmuxServer()` gains 5-try exponential
  backoff (0/0.5/1/2/4s). Socket races during early boot now self-heal.
- `src/agent/session-manager.ts`: `initOnStartup()` isolates each tmux
  cleanup step in its own try/catch so one transient failure doesn't
  abort the whole startup sequence.
- `src/index.ts`: `start --foreground` no longer re-throws on startup
  failure (except for duplicate-instance). Logs the error, forwards it
  via `forwardDaemonError`, and keeps the event loop alive — aligned
  with the existing "daemon must NEVER die from uncaught errors"
  policy already enforced by the global handlers.

Verified locally: cold-boot simulation with a read-only TMUX_TMPDIR
now yields exit 124 (killed by timeout) instead of exit 1 (self-exit);
daemon sits in the idle wait and degrades gracefully. All daemon/server/
web typechecks clean; startup-cleanup + session-restoration unit tests
pass (18/18). Post-restart daemon (PID 360424) runs clean with 0
fatal/error entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: `supervision-broker` and `summary-compressor` create transport
provider sessions directly (their own per-call onComplete/onError listeners
filtered by sid). They never call `registerProviderRoute` because no
IM.codes user-facing session exists. But their deltas still flow through
the globally-wired `transport-relay.onDelta`, hit `resolveSessionName` →
undefined, and logged `level=40` warn per delta. Observed: ~38 warns/min
on a busy daemon (339 warns / 9 min on PID 360424).

- `session-manager.ts`: add `ephemeralProviderSids` Set with
  `markEphemeralProviderSid` / `unmarkEphemeralProviderSid` /
  `isEphemeralProviderSid` helpers.
- `supervision-broker.ts`: mark sid after `createSession`, unmark in
  finally alongside `endSession`.
- `summary-compressor.ts`: mark sid after `createSession`, unmark in
  `endActiveCompressionSession`.
- `transport-relay.ts`: onDelta returns silently when sid is ephemeral
  (still warns for truly unbound sids — real bugs remain visible).

Verified: restarted daemon (PID 481083), 0 "unresolved route" warns in
3 min of active supervision/compression traffic (was 38/min). 0 fatal/
error. All typechecks clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User complaint: tapping a push notification to open the app still showed
stale messages — sometimes for a long time until a new WS event arrived.

Root cause: the existing mount-time HTTP backfill has a 60s cooldown and
fires ONLY on session mount. Three failure modes bypass it:
  1. Target session was already mounted → `setActiveSession` no-ops, no
     mount effect re-run, no backfill.
  2. `visibilitychange` visible-transition only WIPED cooldowns (if hide
     >=60s); it never actively TRIGGERED a backfill.
  3. Mobile `App.appStateChange` resume on Capacitor sometimes doesn't
     fire `visibilitychange` reliably in WebView.

Changes:
- `useTimeline.ts`: new `ACTIVE_TIMELINE_REFRESH_EVENT` + exported name;
  the `onVisibility` handler now dispatches this event on every
  hidden→visible transition (cooldown wipe stays gated on >=60s to
  protect other cached sessions). A per-hook listener force-fires
  `fireHttpBackfill(0, {cooldownMs: 0})` for the active session.
- `push-notifications.ts`: `pushNotificationActionPerformed` dispatches
  the same event after `deck:navigate` — covers the "already on that
  session" case where navigation is a no-op.
- `app.tsx`: native `App.appStateChange` resume also dispatches the
  event, in addition to forcing WS reconnect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cy metrics

The history HTTP backfill path (browser → CF Worker → server → daemon
bridge → timelineStore.read) previously had ZERO timing data anywhere.
Adding lightweight metrics so p50/p95/p99 latency is observable from
daemon + server logs without additional tooling.

Daemon side (`src/daemon/command-handler.ts handleTimelineHistory`):
- `readMs`: timelineStore.read() disk-scan + JSONL parse
- `synthesizeMs`: OpenCode-only synthesis fallback (0 for transport/cc
  sessions — the common case)
- `totalMs`: full handler wall clock
- `eventsReturned` / `eventsRead` / `limit` / `afterTs`
- Emits `timeline.history served` info line per pull

Server side (`server/src/routes/watch.ts`):
- `bridgeMs`: WsBridge.requestTimelineHistory round-trip (server ↔
  daemon WS). Subtracting the daemon's `totalMs` from `bridgeMs` isolates
  network/WS overhead.
- `totalMs`: full route wall clock incl. response serialization
- Emits `timeline.history/full served` on success, `... failed` on
  daemon-offline / timeout / bridge errors
- Added `logger` import (already used by siblings in routes/)

Overhead: ~2 Date.now() calls + one pino/console JSON.stringify on a
7-field object per pull. Daemon pino is `sync:false` (non-blocking
libuv write). Net cost <100 μs on a path that takes 20–150 ms — noise-
free instrumentation.

Benchmark reference (20-iter local runs before this commit):
- deck_cd_brain, limit=300:  p50=19ms p95=27ms  (typical web backfill)
- deck_cd_brain, limit=2000: p50=114ms p95=144ms
- deck_sub_0v0k3e6c, limit=300 (28MB file): p50=66ms p95=87ms

Post-deploy: `jq 'select(.msg == "timeline.history served") |
{sessionName, readMs, totalMs, eventsReturned}' ~/.imcodes/logs/daemon.log`
or server-side equivalent surfaces real-world distribution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion

User-visible bug: setting "Global custom instructions" (e.g. "Always commit
and push if asked!") in the Session Settings dialog had no effect on the
supervisor's behavior. Typical symptom: the supervisor never enforces the
instruction on any session other than the one that was open when saving.

Root cause:
- The web client mirrors `supervisorDefaultsCustomInstructions` into
  `transportConfig.supervision.globalCustomInstructions` only for the
  CURRENTLY-edited session on save (SessionSettingsDialog.tsx:582-584).
- Every other session's snapshot retains its old (often empty) mirror.
- `resolveEffectiveCustomInstructions(snapshot)` merges `snapshot
  .globalCustomInstructions` with `snapshot.customInstructions`; when the
  mirror is empty it merges nothing, so the supervisor prompt has no
  custom instructions section regardless of what the user saved to
  `supervision.user_default` pref.
- The daemon comment in `shared/supervision-config.ts:158-164` explicitly
  states "daemon does not itself read user-default prefs; web client keeps
  this in sync" — but the web client only keeps the current session in sync.

Fix (runtime fallback layer):
- Server: new `GET /api/server/:id/supervision/user-defaults/daemon` endpoint
  (Bearer server-token auth, same pattern as runtime-config/daemon) returns
  the user's `supervision.user_default` pref JSON.
- Daemon: new `src/daemon/supervisor-defaults-cache.ts` caches the
  user-default `customInstructions` string in memory. Primed on daemon
  startup in `lifecycle.ts`; refreshed on every ServerLink (re)connect in
  `server-link.ts` so user edits land within one WS round-trip, not next
  restart.
- Daemon: `enrichSnapshotWithGlobalDefaults()` helper in
  `supervision-automation.ts` layers the cache onto the session snapshot
  when the snapshot's own mirror is empty. Called at both dispatch sites:
  `supervisionBroker.decide({ snapshot: ... })` and
  `buildSupervisionContinuePrompt(..., resolveEffectiveCustomInstructions(...))`.
- When the session snapshot already carries a non-empty
  `globalCustomInstructions` (most common case, from the in-sync session
  path) the helper is a no-op and returns the original reference.

Behavior after fix:
1. User types "Always commit and push if asked!" in Global defaults,
   clicks save.
2. Web PUT /api/preferences/supervision.user_default persists the pref
   (existing flow, untouched).
3. Next time the daemon's WS reconnects (or on next startup), the cache
   refreshes within ~100ms.
4. Every supervisor dispatch thereafter — for EVERY session, including ones
   never edited — sees the custom instructions in the prompt.

No shared/*.ts changes: the merge logic in `resolveEffectiveCustomInstructions`
already tolerates a hydrated `globalCustomInstructions` field. Only the
daemon-side plumbing changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clicking a preset chip (e.g. "minimax") now also sets the model picker to
the preset's ANTHROPIC_MODEL. The UI previously let preset and model drift
apart — the daemon's getQwenPresetTransportConfig overrides the model at
launch anyway, so showing a stale Qwen default in the dropdown was purely
misleading. Applies to both the session-scoped picker and the global
supervision defaults picker.

Clearing the preset leaves the current model selection untouched so the
user's last pick isn't silently lost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ader

The file preview modal already exposes Edit and Download on its header
toolbar. On mobile and in the chat composer's file panel the user often
needs the absolute path — either to paste elsewhere (copy) or to drop it
straight into the chat input (insert). Until now both required closing
the preview and re-opening the file picker to use its Confirm button.

- `FileBrowser.tsx`:
  - New optional `onInsertPath?: (path: string) => void` prop. When the
    host wires it (ChatView does; standalone preview hosts may leave it
    out), an "Insert path" button appears on the preview header and
    dismisses the preview on click.
  - "Copy path" button always available when a path is known; uses
    `navigator.clipboard.writeText` with a 1.5s "Copied!" label flip,
    keyed by path so rapid preview-switches can't leave a stale badge.
  - Both buttons are placed between Download and the Close (✕) button.

- `ChatView.tsx`: forwards its existing `onInsertPath` callback to both
  FileBrowser instances (inline panel + floating preview). The floating
  preview variant also closes the panel on insert for the expected
  "click and go" UX.

- i18n: `fileBrowser.copyPath`, `fileBrowser.insertPath`,
  `fileBrowser.copied` added to all 7 locales
  (en / zh-CN / zh-TW / es / ru / ja / ko).

No regression surface: preview toolbar was already overflowing on very
narrow viewports; the two new buttons use the same `fb-diff-toggle`
class as Edit/Download so flex-wrap handling is identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The server broadcasts DAEMON_MSG.DISCONNECTED the instant the daemon WS
closes, then waits RECONNECT_GRACE_MS (3s) before actually declaring the
daemon offline — inflight commands are replayed silently if the daemon
returns in time, so the turn never fails. The browser, however, flipped
the "Daemon Offline" badge immediately on DISCONNECTED, so every pod
restart / brief network blip flashed red even though the user's send
landed fine.

Match the server's grace window on the client: schedule the
setDaemonOnline(false) flip under the same 3-second timer, and cancel it
when DAEMON_MSG.RECONNECTED or session_list (proof the daemon is alive)
arrives first. Browser-side WS drop cancels it too so a stale timer can't
fire through a later reconnect cycle. Projection staleness still flips
immediately — that's just a data-freshness hint, not the user-facing
status badge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opening a chat via push notification frequently showed "No events yet"
on a session that had plenty of history. Two compounding races:

1. ACTIVE_TIMELINE_REFRESH_EVENT listener in useTimeline re-registered on
   every [sessionId, serverId] change. The notification handler dispatches
   the event synchronously inside the same tick as setActiveSession,
   i.e. while React is still rendering the new session's SessionPane —
   so the listener is in its teardown→re-attach window and the event
   drops on the floor. Fix: attach once with no deps; the handler reads
   the latest sessionId/serverId via fireHttpBackfillRef, and
   fireHttpBackfill itself no-ops when either is unset.

2. Cold mount (no memory cache, no IDB) fired the HTTP backfill under
   MOUNT_BACKFILL_COOLDOWN_MS. A prior cold mount of the same session in
   the same page session stamped the cooldown, so a quick re-mount (e.g.
   toggling between sessions before notification tap) gated the only
   fetch we had, and setEvents([]) stuck around until the next WS event.
   With zero cached data the cooldown is actively harmful — pass 0.

Belt + suspenders in push-notifications.ts: also dispatch the refresh
event after two rAF ticks so a SessionPane that mounts as a direct
result of the deck:navigate → setActiveSession update (cold tab case,
session never previously visited) still catches it. fireHttpBackfill's
200ms debounce coalesces the two dispatches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the app returns from background (push-notification tap, app
switcher, home button), the OS dismisses the keyboard and blurs the
focused input at the native layer — but the WebView doesn't reliably
fire matching focusout / visualViewport resize events. The
`.input-focused` / `.kb-open` classes on <html> persist from the
pre-background state, and styles.css lines 983/989 hide .subcard-bar
while either class is set. Result: user taps a notification, returns
to a chat, and the whole bottom sub-session button row is gone until
they tap the input again.

Add a visibilitychange listener that re-evaluates the real focus state
on resume. If document.activeElement is no longer a text input (which
is the OS-induced blur case), reset the closure flags and recompute —
update() will pull fresh viewport metrics and drop the stale classes.
If focus genuinely survived, leave it alone so active typing isn't
interrupted.

Pair with be0a6b5 (timeline-empty fix) to close out the full
notification-tap regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The supervision prompt block was hardcoded to "Session-specific
supervision instructions from the user:" regardless of where the text
actually came from. A user who set "Always commit and push if asked!" as
a GLOBAL supervisor-defaults rule saw it rendered with the
session-specific heading in every continue/decision prompt, which
misrepresented both the scope and the enforcement role.

Two problems rolled into one:
  1. Scope: global rules were labeled as session-specific.
  2. Semantics: "instructions" framed the block as a free-form chat hint,
     losing the fact that this is a RULE the supervisor enforces —
     readable by both the supervisor judge (to decide continue/complete)
     and the target session (to understand what it must comply with).

Fix:
  - shared/supervision-config.ts: add classifySupervisionCustomInstructions
    + resolveSupervisionCustomInstructionsDetail that return
    { text, source: 'global' | 'session' | 'merged' | 'none' }.
  - supervision-prompts.ts: buildCustomInstructionsSection now accepts the
    detail object and picks one of three headings framed as
    supervision-enforced rules. Global/session/merged each get their own
    explicit scope + "supervision enforces these" clause.
  - supervision-automation.ts: continue-prompt call passes the detail so
    the user-visible continue nudge uses the same correct heading.
  - buildSupervisionContinuePrompt keeps a bare-string overload for
    backward compatibility; bare strings default to the session-specific
    heading, matching historic behavior.
  - Tests updated: 7 custom-instructions tests cover all three source
    paths (global / session / merged) plus the continue-prompt overloads;
    broker test updated for the merged heading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The supervisor decision contract was returning `{decision, reason,
confidence}` with no requirement that `continue` actually articulate WHAT
the target agent should do next. Combined with an over-aggressive regex
guardrail that matched bare Chinese/English state words ("未提交",
"uncommitted"), the system had two failure modes that both manifested
as "supervision keeps tugging back and forth, 5-6 rounds" on a factual
Q&A turn: (1) supervisor returns a filler continue with no gap/action,
target has nothing new to do, loops; (2) the regex downgrades a correct
`complete` to `continue` because the factual answer mentions "未提交".

Schema upgrade (pre-release, so no back-compat needed):
  - SupervisionDecision gains `gap`, `nextAction`, `extra` fields.
  - `extra` reserved for future metadata without another schema bump.
  - Parser stays permissive on old-shape inputs — downgrade happens in
    the guardrail, not at parse time.

New guardrails (src/daemon/supervision-broker.ts):
  - `continue` without a concrete `nextAction` is force-downgraded to
    `ask_human`. Vague fillers ("keep going", "继续完成任务", anything
    under 12 chars) are rejected via isActionableNextAction().
  - Regex CONTINUE_SIGNAL_PATTERNS tightened: bare state markers
    removed (uncommitted / not pushed / 未提交 / 没有提交 / 还没提交 …)
    so factual git-status answers no longer flip complete→continue.
    Kept: intent phrases like "如果你要,我可以顺手", "再提一个 commit",
    two-part English patterns.
  - When the regex does override complete→continue, it now fills in a
    fallback `gap` + `nextAction` so the continue prompt is still
    actionable for the target.

Continue prompt (src/daemon/supervision-prompts.ts):
  - buildSupervisionContinuePrompt now leads with `Next action
    required: <nextAction>` + `What's missing: <gap>` when supplied, and
    only then the supervisor reason. This is the direct fix for the
    "agent gets vague continue → rewrites same reply → loop".
  - New SupervisionContinueInstructions shape accepted alongside the
    legacy bare-string signature for test compat.

Decision / repair prompts:
  - Contract example updated to the 5-field shape.
  - Explicit rule: "Prefer ask_human over a vague continue" + list of
    rejected filler phrases.
  - Explicit rule: factual answers to user questions are complete,
    don't treat state reports as proposed work.

Loop cap (src/daemon/supervision-automation.ts):
  - MAX_AUTO_CONTINUE_STEPS 8 → 2. If two concrete nextActions didn't
    close the gap, the supervisor isn't going to resolve it
    autonomously — escalate to the human on the third cycle.
  - dispatchContinue now takes the full {reason, nextAction, gap}
    triple and forwards it to the continue-prompt builder.

Tests:
  - +3 broker tests: continue downgraded when nextAction missing;
    continue downgraded when nextAction is a filler; continue accepted
    when nextAction is concrete (with gap + extra passthrough).
  - +2 continue-prompt tests: nextAction rendered as lead line; no-arg
    forms still work.
  - Full daemon suite: 3580 passing, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…load

User-visible bug: in the sidebar's server → session tree, the colored
"online" dot for some sub-sessions stays gray indefinitely, even though
the session is actually running and reachable. Most reproducible after
a daemon WS reconnect or when the user first opens a previously-unvisited
server.

Root cause — TWO places broadcast sub-session metadata but neither
included the session's current `state`, so the web's state-dot renderer
fell through to the gray fallback:

1. `useSubSessions.ts:75` initializes newly-loaded sub-sessions with
   `state: 'unknown'` and relies entirely on a subsequent daemon
   broadcast to populate the real state. The web's handler
   (`useSubSessions.ts:141,228`) DOES read `m.state` from
   `subsession.sync` / `subsession.created` messages — but:

2. Both daemon emitters of `subsession.sync` omit the state field:
   - `buildSubSessionSync()` in command-handler.ts — used by
     rebuild_all + set_model + restart + describe paths.
   - `lifecycle.ts` 3s post-connect re-sync broadcast, which ALSO
     filtered to only `state === 'running'` sessions, meaning idle
     sub-sessions never received a sync at all → state stayed
     `'unknown'` → gray dot for quiet-but-alive sessions, sometimes
     forever (until the next live state transition, which might never
     come for a genuinely idle session).

Fix:
- `buildSubSessionSync()`: include `state: r?.state ?? null`.
- `lifecycle.ts` post-connect re-sync: include `state` and change the
  filter from `!== 'running'` (exclude everything except running) to
  `=== 'stopped'` (only exclude terminal states). Idle sub-sessions
  now also receive the sync, closing the gray-dot window.

Web side needs no change — the sync handlers already merge `m.state`
when present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Protect against the empty-write regression: if the initial
GET /api/quick-data fails (network flap, CF Worker blip, daemon lag),
`useQuickData` was rendering EMPTY_QUICK_DATA to the UI, then any user
action (e.g. `recordHistory` on session switch) would fire a debounced
PUT with that empty object — silently overwriting the authoritative
server snapshot with `commands: [], phrases: []`.

- `scheduleSave` now takes a `canPersist` gate. Callers flip it true only
  after the GET resolves successfully and populates local state.
- `useQuickData` tracks `hasHydratedFromServer`; all four save sites
  (recordHistory, updateCommands, updatePhrases, updateSessionHistory)
  pass the flag through.
- Two regression tests in `QuickInputPanel.test.tsx`:
  1. GET fails → `recordHistory` must NOT fire PUT.
  2. GET succeeds → `recordHistory` fires PUT with the updated payload.

Verified:
- `npx vitest run web/test/components/QuickInputPanel.test.tsx` green
- `cd web && npx tsc --noEmit` clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@im4codes im4codes merged commit 6b1e7e0 into master Apr 21, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant