Skip to content

fix(gateway): reconcile sessions stuck running after restart#55823

Closed
VACInc wants to merge 8 commits intoopenclaw:mainfrom
VACInc:fix/stale-running-session-recovery
Closed

fix(gateway): reconcile sessions stuck running after restart#55823
VACInc wants to merge 8 commits intoopenclaw:mainfrom
VACInc:fix/stale-running-session-recovery

Conversation

@VACInc
Copy link
Copy Markdown
Contributor

@VACInc VACInc commented Mar 27, 2026

Summary

  • Problem: if a run dies or the gateway restarts before a terminal lifecycle event is persisted, the session store can keep status: "running" forever. /stop also persisted abortedLastRun without forcing a terminal state, so aborted sessions could stay wedged too.
  • Why it matters: topics can look permanently busy even when nothing is running, and later gateway/session tooling reads a false active state.
  • What changed: added a shared session-entry helper for terminal killed state, made startup reconciliation convert persisted stale running entries to killed, and made /stop persistence write the same terminal state while preserving abort cutoff metadata.
  • What did NOT change (scope boundary): no auth/runtime-home behavior, no websocket transport behavior, no reset-hook behavior, and no memory transcript indexing behavior. This PR does not overlap the currently open PRs #55796, #53872, #53706, or #20183.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes: N/A
  • Related: N/A
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: session lifecycle state cleared running only when a later terminal event (end/error) arrived. If the process died or restarted first, sessions.json kept the stale running row. The /stop persistence path had the same gap because it set abort metadata without forcing a terminal session status.
  • Missing detection / guardrail: there was no startup reconciliation pass for persisted running rows and no regression asserting that /stop writes a terminal status.
  • Prior context (git blame, prior PR, issue, or refactor if known): reproduced from a real stuck topic whose transcript stopped at a toolResult, followed by gateway restart logs; the persisted session row still showed status: "running" with no later transcript progress.
  • Why this regressed now: restart/interruption exposed an older assumption that every started run would always emit a later terminal lifecycle event.
  • If unknown, what was ruled out: ruled out the currently open auth/home-dir PR and other open session-reset/prompt PRs because none touch these files or this lifecycle path.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-startup.test.ts, src/auto-reply/reply/abort.test.ts
  • Scenario the test should lock in: stale persisted running sessions are reconciled to killed on startup, and /stop persistence writes terminal killed state together with abort metadata.
  • Why this is the smallest reliable guardrail: the bug lives at the session-store lifecycle seam, not inside a single pure helper or only in a full end-to-end path.
  • Existing test that already covers this (if any): none
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Sessions abandoned by a restart no longer stay stuck in running forever.
  • /stop now leaves the target session in a terminal killed state instead of only tagging abortedLastRun.

Diagram (if applicable)

Before:
[run starts] -> [session store = running] -> [process dies/restarts before end/error] -> [row stays running forever]

After:
[run starts] -> [session store = running] -> [process dies/restarts before end/error] -> [startup reconciliation marks row killed]

/stop path:
[operator sends /stop] -> [abort metadata only] -> [row can still look running]
becomes
[operator sends /stop] -> [abort metadata + killed terminal state] -> [row is no longer wedged]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local temp clone, no container
  • Model/provider: N/A
  • Integration/channel (if any): Telegram topic session observed in a live gateway state dir; code fix validated in isolated local clone
  • Relevant config (redacted): standard ~/.openclaw/agents/<agent>/sessions/sessions.json session store

Steps

  1. Persist a session row with status: "running".
  2. Simulate a process interruption or restart before the run writes its terminal lifecycle state, or trigger /stop against the session.
  3. Restart gateway startup recovery and inspect the session store.

Expected

  • Stale interrupted sessions reconcile to killed and no longer look actively running.

Actual

  • Before this change, stale rows stayed running indefinitely unless some later lifecycle event happened to clear them.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: reproduced the stale-running session shape from a real topic; confirmed the persisted row could remain running after a gateway restart; ran pnpm test -- src/gateway/server-startup.test.ts -t "reconciles persisted running sessions left behind by an earlier process"; ran pnpm test -- src/auto-reply/reply/abort.test.ts -t "persists abort cutoff metadata on /stop when command and target session match|does not persist cutoff metadata when native /stop targets a different session"; ran pnpm build.
  • Edge cases checked: already-finished sessions stay unchanged; already-aborted stale sessions keep abortedLastRun: true; startup reconciliation does not fabricate abortedLastRun for plain restart recovery.
  • What you did not verify: I did not restart a live gateway from this branch. Also, pnpm test -- src/auto-reply/reply/abort.test.ts still has unrelated pre-existing failures on latest main in subagent cascade assertions outside this PR scope.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: if multiple gateway processes intentionally share one session store, startup reconciliation could mark another writer's in-flight session as killed.
    • Mitigation: the recovery runs in the same startup path that already cleans stale session locks and assumes single-writer ownership of an agent session store.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: M labels Mar 27, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR fixes two related session-lifecycle gaps in the gateway: (1) persisted running sessions that were never cleared when a process died or restarted before writing a terminal lifecycle event, and (2) the /stop path that set abort metadata without forcing a terminal status on the session entry. Both are addressed with a new shared helper (applyKilledSessionEntryState) and a startup reconciliation pass that rewrites any stale running rows to killed.\n\nKey changes:\n- src/config/sessions/run-state.ts – new applyKilledSessionEntryState helper atomically stamps status: \"killed\", endedAt, updatedAt, and runtimeMs on a session entry; markAbortedLastRun controls whether abortedLastRun: true is added.\n- src/auto-reply/reply/commands-session-store.tspersistAbortTargetEntry now uses the shared helper instead of manually setting fields; legacy-key cleanup (previously missing in this path) is also applied to both the in-memory and disk update paths.\n- src/gateway/server-startup.tsreconcilePersistedRunningSessionsOnStartup iterates all agent session store files, identifies stale running entries, and rewrites them atomically via updateSessionStore.\n- Tests – new integration test for startup reconciliation and extended /stop assertions to cover the terminal-state fields.\n- One style note: markAbortedLastRun !== false as the implicit default in applyKilledSessionEntryState means callers who omit the parameter silently receive abortedLastRun: true. Inverting the default to require an explicit true would be safer for future callers that want to kill without abort semantics.

Confidence Score: 5/5

Safe to merge — no regressions introduced; the fix is confined to the session-lifecycle seam and guarded by new integration tests.

All remaining findings are P2 (style/naming). The core reconciliation and abort-persistence logic is correct, single-writer ownership is preserved, timestamps are consistent, and edge cases (already-terminal sessions, already-aborted sessions, empty stores) are tested and handled correctly.

No files require special attention. The implicit markAbortedLastRun !== false default in run-state.ts is worth revisiting before this function grows more callers.

Important Files Changed

Filename Overview
src/config/sessions/run-state.ts New shared helper for terminal killed state — logic is correct; only concern is the implicit markAbortedLastRun default.
src/auto-reply/reply/commands-session-store.ts Correctly migrated to applyKilledSessionEntryState; legacy-key cleanup now present in both in-memory and disk paths.
src/auto-reply/reply/abort.ts Simplified by delegating to persistAbortTargetEntry; new setAbortMemory fallback is a minor correctness improvement.
src/gateway/server-startup.ts Startup reconciliation correctly uses updateSessionStore for atomic rewrite of stale running rows; TOCTOU between pre-check and lock is benign.
src/gateway/server-startup.test.ts Covers plain stale running, already-aborted stale running, and already-terminal sessions with thorough assertions.
src/auto-reply/reply/abort.test.ts Extended to verify terminal fields (status, endedAt, updatedAt) written by persistAbortTargetEntry.
src/config/sessions.ts Single-line barrel re-export of the new run-state module. No issues.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/run-state.ts
Line: 25-27

Comment:
**Implicit default sets `abortedLastRun: true`**

The condition `params.markAbortedLastRun !== false` means calling `applyKilledSessionEntryState(entry)` with no second argument (or with `{}`) will silently set `abortedLastRun = true`. A future caller who just wants to mark a session as `killed` without any abort semantics (e.g. a timeout handler) would have to pass `markAbortedLastRun: false` explicitly to avoid the unintended flag, which is counter-intuitive — the safe default is typically not to set the flag unless asked.

Consider inverting the default to require an explicit `true`, e.g. `if (params.markAbortedLastRun === true)`, or renaming it to `addAbortedLastRun` with a strict `if (params.addAbortedLastRun)` guard. Either way the existing callers (`persistAbortTargetEntry` with `markAbortedLastRun: true` and startup reconciliation with `markAbortedLastRun: false`) would continue to work correctly with an explicit argument.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: recover stale running sessions afte..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9734cda8ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

const stateDir = resolveStateDir(process.env);
const sessionDirs = await resolveAgentSessionDirs(stateDir);

P2 Badge Reconcile stores from config, not only stateDir defaults

startGatewaySidecars builds recovery targets from resolveStateDir(process.env) and resolveAgentSessionDirs, but session writes elsewhere use resolveStorePath(cfg.session?.store, ...) and allow custom absolute/template paths. When session.store is configured outside <stateDir>/agents/*/sessions/sessions.json, stale running entries are never scanned, so this restart reconciliation fix is silently skipped for a supported configuration.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9907226a04

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

try {
const stateDir = resolveStateDir(process.env);
const sessionDirs = await resolveAgentSessionDirs(stateDir);
for (const sessionsDir of sessionDirs) {
await cleanStaleLockFiles({

P2 Badge Run stale-session reconciliation even when lock cleanup fails

cleanStaleLockFiles and reconcilePersistedRunningSessionsOnStartup are wrapped in the same try block, so a single cleanup error (for example, one unreadable sessionsDir) short-circuits before reconciliation is attempted. In that case startup leaves stale status: "running" entries untouched across otherwise healthy stores, which defeats the recovery behavior introduced by this change under partial filesystem failures. Keep reconciliation in a separate best-effort block (or isolate per-directory cleanup failures) so lock cleanup issues do not disable session-state repair.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fca753a321

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a85f396e1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


let reconciledInStore = 0;
try {
reconciledInStore = await updateSessionStore(storePath, (nextStore) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle fresh sessions.json.lock files before startup reconciliation

reconcilePersistedRunningSessionsOnStartup performs its write through updateSessionStore, which acquires sessions.json.lock with a 10s timeout and ~30s stale threshold. After a quick restart from an unclean shutdown, a leftover store lock that is younger than that threshold cannot be reclaimed in time, so this update throws and the store is skipped, leaving stale status: "running" rows unreconciled. The earlier startup lock sweep does not prevent this because cleanStaleLockFiles only targets *.jsonl.lock files, not sessions.json.lock.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug through the current lock path in this branch and this specific failure mode does not reproduce as described. Startup reconciliation writes via updateSessionStore -> withSessionStoreLock -> acquireSessionWriteLock, and on contention the lock code inspects the lock payload and reclaims stale/dead-owner locks immediately. For unclean-shutdown leftovers (dead PID), reclamation does not wait for the 30s threshold. The startup pre-sweep is currently only for jsonl transcript locks, but sessions.json lock reclamation is already handled in the contended acquire path used by reconciliation.

@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Mar 29, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ec5abaf92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

},
});
await fs.mkdir(blockedDir, { recursive: true });
await fs.chmod(blockedDir, 0o000);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate permission-simulation test on non-Windows platforms

This test assumes chmod 0o000 will make blockedDir inaccessible, but in Windows runners that behavior is not enforced the same way, so blockedStorePath is usually treated as missing (ENOENT) and the warning expectation fails. I checked .github/workflows/ci.yml and checks-windows runs pnpm test, so this can break CI for the whole PR even though the production code is fine; the test should skip on win32 or simulate an inaccessible store in a platform-neutral way.

Useful? React with 👍 / 👎.


for (const storePath of storePaths) {
try {
await fs.access(storePath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check read access before loading startup session stores

The pre-check uses fs.access(storePath) with the default F_OK, which only verifies existence; if the file exists but is unreadable (EACCES), loadSessionStore(..., { skipCache: true }) falls back to {} instead of throwing, so recovery silently skips stale running rows and emits no warning. This leaves the exact stuck-state this startup reconciliation is meant to fix whenever ownership/permissions drift on sessions.json.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0976e5b7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}),
});
try {
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore outer lock-release guard in compaction path

After acquiring sessionLock, this change replaced the outer try/finally with a plain block, so any exception before the inner try at line 733 (for example in repairSessionFileIfNeeded, prewarmSessionFile, resourceLoader.reload, or createAgentSession) exits without calling sessionLock.release(). That leaves a stale write lock behind and can cause subsequent compaction/run attempts on the same session file to block or timeout until stale-lock recovery kicks in.

Useful? React with 👍 / 👎.

@VACInc VACInc closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant