fix(gateway): reconcile sessions stuck running after restart#55823
fix(gateway): reconcile sessions stuck running after restart#55823VACInc wants to merge 8 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes two related session-lifecycle gaps in the gateway: (1) persisted Confidence Score: 5/5Safe to merge — no regressions introduced; the fix is confined to the session-lifecycle seam and guarded by new integration tests. All remaining findings are P2 (style/naming). The core reconciliation and abort-persistence logic is correct, single-writer ownership is preserved, timestamps are consistent, and edge cases (already-terminal sessions, already-aborted sessions, empty stores) are tested and handled correctly. No files require special attention. The implicit markAbortedLastRun !== false default in run-state.ts is worth revisiting before this function grows more callers.
|
| Filename | Overview |
|---|---|
| src/config/sessions/run-state.ts | New shared helper for terminal killed state — logic is correct; only concern is the implicit markAbortedLastRun default. |
| src/auto-reply/reply/commands-session-store.ts | Correctly migrated to applyKilledSessionEntryState; legacy-key cleanup now present in both in-memory and disk paths. |
| src/auto-reply/reply/abort.ts | Simplified by delegating to persistAbortTargetEntry; new setAbortMemory fallback is a minor correctness improvement. |
| src/gateway/server-startup.ts | Startup reconciliation correctly uses updateSessionStore for atomic rewrite of stale running rows; TOCTOU between pre-check and lock is benign. |
| src/gateway/server-startup.test.ts | Covers plain stale running, already-aborted stale running, and already-terminal sessions with thorough assertions. |
| src/auto-reply/reply/abort.test.ts | Extended to verify terminal fields (status, endedAt, updatedAt) written by persistAbortTargetEntry. |
| src/config/sessions.ts | Single-line barrel re-export of the new run-state module. No issues. |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/config/sessions/run-state.ts
Line: 25-27
Comment:
**Implicit default sets `abortedLastRun: true`**
The condition `params.markAbortedLastRun !== false` means calling `applyKilledSessionEntryState(entry)` with no second argument (or with `{}`) will silently set `abortedLastRun = true`. A future caller who just wants to mark a session as `killed` without any abort semantics (e.g. a timeout handler) would have to pass `markAbortedLastRun: false` explicitly to avoid the unintended flag, which is counter-intuitive — the safe default is typically not to set the flag unless asked.
Consider inverting the default to require an explicit `true`, e.g. `if (params.markAbortedLastRun === true)`, or renaming it to `addAbortedLastRun` with a strict `if (params.addAbortedLastRun)` guard. Either way the existing callers (`persistAbortTargetEntry` with `markAbortedLastRun: true` and startup reconciliation with `markAbortedLastRun: false`) would continue to work correctly with an explicit argument.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: recover stale running sessions afte..." | Re-trigger Greptile
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9734cda8ca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
openclaw/src/gateway/server-startup.ts
Lines 138 to 139 in 1e52156
startGatewaySidecars builds recovery targets from resolveStateDir(process.env) and resolveAgentSessionDirs, but session writes elsewhere use resolveStorePath(cfg.session?.store, ...) and allow custom absolute/template paths. When session.store is configured outside <stateDir>/agents/*/sessions/sessions.json, stale running entries are never scanned, so this restart reconciliation fix is silently skipped for a supported configuration.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9907226a04
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
openclaw/src/gateway/server-startup.ts
Lines 137 to 141 in 56e5d9e
cleanStaleLockFiles and reconcilePersistedRunningSessionsOnStartup are wrapped in the same try block, so a single cleanup error (for example, one unreadable sessionsDir) short-circuits before reconciliation is attempted. In that case startup leaves stale status: "running" entries untouched across otherwise healthy stores, which defeats the recovery behavior introduced by this change under partial filesystem failures. Keep reconciliation in a separate best-effort block (or isolate per-directory cleanup failures) so lock cleanup issues do not disable session-state repair.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fca753a321
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a85f396e1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| let reconciledInStore = 0; | ||
| try { | ||
| reconciledInStore = await updateSessionStore(storePath, (nextStore) => { |
There was a problem hiding this comment.
Handle fresh sessions.json.lock files before startup reconciliation
reconcilePersistedRunningSessionsOnStartup performs its write through updateSessionStore, which acquires sessions.json.lock with a 10s timeout and ~30s stale threshold. After a quick restart from an unclean shutdown, a leftover store lock that is younger than that threshold cannot be reclaimed in time, so this update throws and the store is skipped, leaving stale status: "running" rows unreconciled. The earlier startup lock sweep does not prevent this because cleanStaleLockFiles only targets *.jsonl.lock files, not sessions.json.lock.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I dug through the current lock path in this branch and this specific failure mode does not reproduce as described. Startup reconciliation writes via updateSessionStore -> withSessionStoreLock -> acquireSessionWriteLock, and on contention the lock code inspects the lock payload and reclaims stale/dead-owner locks immediately. For unclean-shutdown leftovers (dead PID), reclamation does not wait for the 30s threshold. The startup pre-sweep is currently only for jsonl transcript locks, but sessions.json lock reclamation is already handled in the contended acquire path used by reconciliation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ec5abaf92
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }, | ||
| }); | ||
| await fs.mkdir(blockedDir, { recursive: true }); | ||
| await fs.chmod(blockedDir, 0o000); |
There was a problem hiding this comment.
Gate permission-simulation test on non-Windows platforms
This test assumes chmod 0o000 will make blockedDir inaccessible, but in Windows runners that behavior is not enforced the same way, so blockedStorePath is usually treated as missing (ENOENT) and the warning expectation fails. I checked .github/workflows/ci.yml and checks-windows runs pnpm test, so this can break CI for the whole PR even though the production code is fine; the test should skip on win32 or simulate an inaccessible store in a platform-neutral way.
Useful? React with 👍 / 👎.
|
|
||
| for (const storePath of storePaths) { | ||
| try { | ||
| await fs.access(storePath); |
There was a problem hiding this comment.
Check read access before loading startup session stores
The pre-check uses fs.access(storePath) with the default F_OK, which only verifies existence; if the file exists but is unreadable (EACCES), loadSessionStore(..., { skipCache: true }) falls back to {} instead of throwing, so recovery silently skips stale running rows and emits no warning. This leaves the exact stuck-state this startup reconciliation is meant to fix whenever ownership/permissions drift on sessions.json.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0976e5b7b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }), | ||
| }); | ||
| try { | ||
| { |
There was a problem hiding this comment.
Restore outer lock-release guard in compaction path
After acquiring sessionLock, this change replaced the outer try/finally with a plain block, so any exception before the inner try at line 733 (for example in repairSessionFileIfNeeded, prewarmSessionFile, resourceLoader.reload, or createAgentSession) exits without calling sessionLock.release(). That leaves a stale write lock behind and can cause subsequent compaction/run attempts on the same session file to block or timeout until stale-lock recovery kicks in.
Useful? React with 👍 / 👎.
Summary
status: "running"forever./stopalso persistedabortedLastRunwithout forcing a terminal state, so aborted sessions could stay wedged too.runningentries tokilled, and made/stoppersistence write the same terminal state while preserving abort cutoff metadata.#55796,#53872,#53706, or#20183.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
runningonly when a later terminal event (end/error) arrived. If the process died or restarted first,sessions.jsonkept the stalerunningrow. The/stoppersistence path had the same gap because it set abort metadata without forcing a terminal session status.runningrows and no regression asserting that/stopwrites a terminal status.git blame, prior PR, issue, or refactor if known): reproduced from a real stuck topic whose transcript stopped at atoolResult, followed by gateway restart logs; the persisted session row still showedstatus: "running"with no later transcript progress.Regression Test Plan (if applicable)
src/gateway/server-startup.test.ts,src/auto-reply/reply/abort.test.tsrunningsessions are reconciled tokilledon startup, and/stoppersistence writes terminal killed state together with abort metadata.User-visible / Behavior Changes
runningforever./stopnow leaves the target session in a terminalkilledstate instead of only taggingabortedLastRun.Diagram (if applicable)
Security Impact (required)
Yes/No): NoYes/No): NoYes/No): NoYes/No): NoYes/No): NoYes, explain risk + mitigation: N/ARepro + Verification
Environment
~/.openclaw/agents/<agent>/sessions/sessions.jsonsession storeSteps
status: "running"./stopagainst the session.Expected
killedand no longer look actively running.Actual
runningindefinitely unless some later lifecycle event happened to clear them.Evidence
Attach at least one:
Human Verification (required)
runningafter a gateway restart; ranpnpm test -- src/gateway/server-startup.test.ts -t "reconciles persisted running sessions left behind by an earlier process"; ranpnpm test -- src/auto-reply/reply/abort.test.ts -t "persists abort cutoff metadata on /stop when command and target session match|does not persist cutoff metadata when native /stop targets a different session"; ranpnpm build.abortedLastRun: true; startup reconciliation does not fabricateabortedLastRunfor plain restart recovery.pnpm test -- src/auto-reply/reply/abort.test.tsstill has unrelated pre-existing failures on latestmainin subagent cascade assertions outside this PR scope.Review Conversations
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoRisks and Mitigations