fix(gateway): reconcile sessions stuck running after restart by VACInc · Pull Request #55823 · openclaw/openclaw

VACInc · 2026-03-27T14:16:32Z

Summary

Problem: if a run dies or the gateway restarts before a terminal lifecycle event is persisted, the session store can keep status: "running" forever. /stop also persisted abortedLastRun without forcing a terminal state, so aborted sessions could stay wedged too.
Why it matters: topics can look permanently busy even when nothing is running, and later gateway/session tooling reads a false active state.
What changed: added a shared session-entry helper for terminal killed state, made startup reconciliation convert persisted stale running entries to killed, and made /stop persistence write the same terminal state while preserving abort cutoff metadata.
What did NOT change (scope boundary): no auth/runtime-home behavior, no websocket transport behavior, no reset-hook behavior, and no memory transcript indexing behavior. This PR does not overlap the currently open PRs #55796, #53872, #53706, or #20183.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes: N/A
Related: N/A
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: session lifecycle state cleared running only when a later terminal event (end/error) arrived. If the process died or restarted first, sessions.json kept the stale running row. The /stop persistence path had the same gap because it set abort metadata without forcing a terminal session status.
Missing detection / guardrail: there was no startup reconciliation pass for persisted running rows and no regression asserting that /stop writes a terminal status.
Prior context (git blame, prior PR, issue, or refactor if known): reproduced from a real stuck topic whose transcript stopped at a toolResult, followed by gateway restart logs; the persisted session row still showed status: "running" with no later transcript progress.
Why this regressed now: restart/interruption exposed an older assumption that every started run would always emit a later terminal lifecycle event.
If unknown, what was ruled out: ruled out the currently open auth/home-dir PR and other open session-reset/prompt PRs because none touch these files or this lifecycle path.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-startup.test.ts, src/auto-reply/reply/abort.test.ts
Scenario the test should lock in: stale persisted running sessions are reconciled to killed on startup, and /stop persistence writes terminal killed state together with abort metadata.
Why this is the smallest reliable guardrail: the bug lives at the session-store lifecycle seam, not inside a single pure helper or only in a full end-to-end path.
Existing test that already covers this (if any): none
If no new test is added, why not: N/A

User-visible / Behavior Changes

Sessions abandoned by a restart no longer stay stuck in running forever.
/stop now leaves the target session in a terminal killed state instead of only tagging abortedLastRun.

Diagram (if applicable)

Before:
[run starts] -> [session store = running] -> [process dies/restarts before end/error] -> [row stays running forever]

After:
[run starts] -> [session store = running] -> [process dies/restarts before end/error] -> [startup reconciliation marks row killed]

/stop path:
[operator sends /stop] -> [abort metadata only] -> [row can still look running]
becomes
[operator sends /stop] -> [abort metadata + killed terminal state] -> [row is no longer wedged]

Security Impact (required)

New permissions/capabilities? (Yes/No): No
Secrets/tokens handling changed? (Yes/No): No
New/changed network calls? (Yes/No): No
Command/tool execution surface changed? (Yes/No): No
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Linux
Runtime/container: local temp clone, no container
Model/provider: N/A
Integration/channel (if any): Telegram topic session observed in a live gateway state dir; code fix validated in isolated local clone
Relevant config (redacted): standard ~/.openclaw/agents/<agent>/sessions/sessions.json session store

Steps

Persist a session row with status: "running".
Simulate a process interruption or restart before the run writes its terminal lifecycle state, or trigger /stop against the session.
Restart gateway startup recovery and inspect the session store.

Expected

Stale interrupted sessions reconcile to killed and no longer look actively running.

Actual

Before this change, stale rows stayed running indefinitely unless some later lifecycle event happened to clear them.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: reproduced the stale-running session shape from a real topic; confirmed the persisted row could remain running after a gateway restart; ran pnpm test -- src/gateway/server-startup.test.ts -t "reconciles persisted running sessions left behind by an earlier process"; ran pnpm test -- src/auto-reply/reply/abort.test.ts -t "persists abort cutoff metadata on /stop when command and target session match|does not persist cutoff metadata when native /stop targets a different session"; ran pnpm build.
Edge cases checked: already-finished sessions stay unchanged; already-aborted stale sessions keep abortedLastRun: true; startup reconciliation does not fabricate abortedLastRun for plain restart recovery.
What you did not verify: I did not restart a live gateway from this branch. Also, pnpm test -- src/auto-reply/reply/abort.test.ts still has unrelated pre-existing failures on latest main in subagent cascade assertions outside this PR scope.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): No
Migration needed? (Yes/No): No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: if multiple gateway processes intentionally share one session store, startup reconciliation could mark another writer's in-flight session as killed.
- Mitigation: the recovery runs in the same startup path that already cleans stale session locks and assumes single-writer ownership of an agent session store.

greptile-apps · 2026-03-27T14:20:35Z

Greptile Summary

This PR fixes two related session-lifecycle gaps in the gateway: (1) persisted running sessions that were never cleared when a process died or restarted before writing a terminal lifecycle event, and (2) the /stop path that set abort metadata without forcing a terminal status on the session entry. Both are addressed with a new shared helper (applyKilledSessionEntryState) and a startup reconciliation pass that rewrites any stale running rows to killed.\n\nKey changes:\n- src/config/sessions/run-state.ts – new applyKilledSessionEntryState helper atomically stamps status: \"killed\", endedAt, updatedAt, and runtimeMs on a session entry; markAbortedLastRun controls whether abortedLastRun: true is added.\n- src/auto-reply/reply/commands-session-store.ts – persistAbortTargetEntry now uses the shared helper instead of manually setting fields; legacy-key cleanup (previously missing in this path) is also applied to both the in-memory and disk update paths.\n- src/gateway/server-startup.ts – reconcilePersistedRunningSessionsOnStartup iterates all agent session store files, identifies stale running entries, and rewrites them atomically via updateSessionStore.\n- Tests – new integration test for startup reconciliation and extended /stop assertions to cover the terminal-state fields.\n- One style note: markAbortedLastRun !== false as the implicit default in applyKilledSessionEntryState means callers who omit the parameter silently receive abortedLastRun: true. Inverting the default to require an explicit true would be safer for future callers that want to kill without abort semantics.

Confidence Score: 5/5

Safe to merge — no regressions introduced; the fix is confined to the session-lifecycle seam and guarded by new integration tests.

All remaining findings are P2 (style/naming). The core reconciliation and abort-persistence logic is correct, single-writer ownership is preserved, timestamps are consistent, and edge cases (already-terminal sessions, already-aborted sessions, empty stores) are tested and handled correctly.

No files require special attention. The implicit markAbortedLastRun !== false default in run-state.ts is worth revisiting before this function grows more callers.

Important Files Changed

Filename	Overview
src/config/sessions/run-state.ts	New shared helper for terminal killed state — logic is correct; only concern is the implicit markAbortedLastRun default.
src/auto-reply/reply/commands-session-store.ts	Correctly migrated to applyKilledSessionEntryState; legacy-key cleanup now present in both in-memory and disk paths.
src/auto-reply/reply/abort.ts	Simplified by delegating to persistAbortTargetEntry; new setAbortMemory fallback is a minor correctness improvement.
src/gateway/server-startup.ts	Startup reconciliation correctly uses updateSessionStore for atomic rewrite of stale running rows; TOCTOU between pre-check and lock is benign.
src/gateway/server-startup.test.ts	Covers plain stale running, already-aborted stale running, and already-terminal sessions with thorough assertions.
src/auto-reply/reply/abort.test.ts	Extended to verify terminal fields (status, endedAt, updatedAt) written by persistAbortTargetEntry.
src/config/sessions.ts	Single-line barrel re-export of the new run-state module. No issues.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/config/sessions/run-state.ts
Line: 25-27

Comment:
**Implicit default sets `abortedLastRun: true`**

The condition `params.markAbortedLastRun !== false` means calling `applyKilledSessionEntryState(entry)` with no second argument (or with `{}`) will silently set `abortedLastRun = true`. A future caller who just wants to mark a session as `killed` without any abort semantics (e.g. a timeout handler) would have to pass `markAbortedLastRun: false` explicitly to avoid the unintended flag, which is counter-intuitive — the safe default is typically not to set the flag unless asked.

Consider inverting the default to require an explicit `true`, e.g. `if (params.markAbortedLastRun === true)`, or renaming it to `addAbortedLastRun` with a strict `if (params.addAbortedLastRun)` guard. Either way the existing callers (`persistAbortTargetEntry` with `markAbortedLastRun: true` and startup reconciliation with `markAbortedLastRun: false`) would continue to work correctly with an explicit argument.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: recover stale running sessions afte..." | Re-trigger Greptile}

src/config/sessions/run-state.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9734cda8ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/auto-reply/reply/commands-session-store.ts

chatgpt-codex-connector

💡 Codex Review

openclaw/src/gateway/server-startup.ts

Lines 138 to 139 in 1e52156

    
           const stateDir = resolveStateDir(process.env); 
        
           const sessionDirs = await resolveAgentSessionDirs(stateDir);

Reconcile stores from config, not only stateDir defaults

startGatewaySidecars builds recovery targets from resolveStateDir(process.env) and resolveAgentSessionDirs, but session writes elsewhere use resolveStorePath(cfg.session?.store, ...) and allow custom absolute/template paths. When session.store is configured outside <stateDir>/agents/*/sessions/sessions.json, stale running entries are never scanned, so this restart reconciliation fix is silently skipped for a supported configuration.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/gateway/server-startup.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9907226a04

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/gateway/server-startup.ts

chatgpt-codex-connector

💡 Codex Review

openclaw/src/gateway/server-startup.ts

Lines 137 to 141 in 56e5d9e

    
           try { 
        
             const stateDir = resolveStateDir(process.env); 
        
             const sessionDirs = await resolveAgentSessionDirs(stateDir); 
        
             for (const sessionsDir of sessionDirs) { 
        
               await cleanStaleLockFiles({

Run stale-session reconciliation even when lock cleanup fails

cleanStaleLockFiles and reconcilePersistedRunningSessionsOnStartup are wrapped in the same try block, so a single cleanup error (for example, one unreadable sessionsDir) short-circuits before reconciliation is attempted. In that case startup leaves stale status: "running" entries untouched across otherwise healthy stores, which defeats the recovery behavior introduced by this change under partial filesystem failures. Keep reconciliation in a separate best-effort block (or isolate per-directory cleanup failures) so lock cleanup issues do not disable session-state repair.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fca753a321

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/gateway/server-startup.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a85f396e1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-28T17:16:37Z

src/gateway/server-startup.ts

+
+    let reconciledInStore = 0;
+    try {
+      reconciledInStore = await updateSessionStore(storePath, (nextStore) => {


Handle fresh sessions.json.lock files before startup reconciliation

reconcilePersistedRunningSessionsOnStartup performs its write through updateSessionStore, which acquires sessions.json.lock with a 10s timeout and ~30s stale threshold. After a quick restart from an unclean shutdown, a leftover store lock that is younger than that threshold cannot be reclaimed in time, so this update throws and the store is skipped, leaving stale status: "running" rows unreconciled. The earlier startup lock sweep does not prevent this because cleanStaleLockFiles only targets *.jsonl.lock files, not sessions.json.lock.

Useful? React with 👍 / 👎.

I dug through the current lock path in this branch and this specific failure mode does not reproduce as described. Startup reconciliation writes via updateSessionStore -> withSessionStoreLock -> acquireSessionWriteLock, and on contention the lock code inspects the lock payload and reclaims stale/dead-owner locks immediately. For unclean-shutdown leftovers (dead PID), reclamation does not wait for the 30s threshold. The startup pre-sweep is currently only for jsonl transcript locks, but sessions.json lock reclamation is already handled in the contended acquire path used by reconciliation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ec5abaf92

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-29T19:06:11Z

src/gateway/server-startup.test.ts

+        },
+      });
+      await fs.mkdir(blockedDir, { recursive: true });
+      await fs.chmod(blockedDir, 0o000);


Gate permission-simulation test on non-Windows platforms

This test assumes chmod 0o000 will make blockedDir inaccessible, but in Windows runners that behavior is not enforced the same way, so blockedStorePath is usually treated as missing (ENOENT) and the warning expectation fails. I checked .github/workflows/ci.yml and checks-windows runs pnpm test, so this can break CI for the whole PR even though the production code is fine; the test should skip on win32 or simulate an inaccessible store in a platform-neutral way.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-29T19:06:11Z

src/gateway/server-startup.ts

+
+  for (const storePath of storePaths) {
+    try {
+      await fs.access(storePath);


Check read access before loading startup session stores

The pre-check uses fs.access(storePath) with the default F_OK, which only verifies existence; if the file exists but is unreadable (EACCES), loadSessionStore(..., { skipCache: true }) falls back to {} instead of throwing, so recovery silently skips stale running rows and emits no warning. This leaves the exact stuck-state this startup reconciliation is meant to fix whenever ownership/permissions drift on sessions.json.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0976e5b7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-29T22:01:17Z

src/agents/pi-embedded-runner/compact.ts

      }),
    });
-    try {
+    {


Restore outer lock-release guard in compaction path

After acquiring sessionLock, this change replaced the outer try/finally with a plain block, so any exception before the inner try at line 733 (for example in repairSessionFileIfNeeded, prewarmSessionFile, resourceLoader.reload, or createAgentSession) exits without calling sessionLock.release(). That leaves a stale write lock behind and can cause subsequent compaction/run attempts on the same session file to block or timeout until stale-lock recovery kicks in.

Useful? React with 👍 / 👎.

fix: recover stale running sessions after restart

9734cda

openclaw-barnacle bot added gateway Gateway runtime size: M labels Mar 27, 2026

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

src/config/sessions/run-state.ts Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

src/auto-reply/reply/commands-session-store.ts Outdated Show resolved Hide resolved

Merge branch 'main' into fix/stale-running-session-recovery

1e52156

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

src/gateway/server-startup.ts Outdated Show resolved Hide resolved

Merge branch 'main' into fix/stale-running-session-recovery

9907226

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

src/gateway/server-startup.ts Outdated Show resolved Hide resolved

Merge branch 'main' into fix/stale-running-session-recovery

56e5d9e

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

gateway: harden startup stale-session reconciliation

fca753a

openclaw-barnacle bot added size: L and removed size: M labels Mar 28, 2026

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

src/gateway/server-startup.ts Outdated Show resolved Hide resolved

gateway: include configured agent ids in startup store templates

9a85f39

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

run: always release session lock during teardown

4ec5aba

openclaw-barnacle bot added the agents Agent runtime and tooling label Mar 29, 2026

chatgpt-codex-connector bot reviewed Mar 29, 2026

View reviewed changes

fix(compaction): release lock before compact teardown

f0976e5

chatgpt-codex-connector bot reviewed Mar 29, 2026

View reviewed changes

VACInc closed this Mar 29, 2026

	const stateDir = resolveStateDir(process.env);
	const sessionDirs = await resolveAgentSessionDirs(stateDir);

	try {
	const stateDir = resolveStateDir(process.env);
	const sessionDirs = await resolveAgentSessionDirs(stateDir);
	for (const sessionsDir of sessionDirs) {
	await cleanStaleLockFiles({

Uh oh!

Conversation

VACInc commented Mar 27, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps bot commented Mar 27, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

VACInc Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!