Skip to content

fix: clear cron session in DB when not cached (cold-cache after restart)#424

Merged
viettranx merged 2 commits intomainfrom
fix/cron-session-reset-cold-cache
Mar 24, 2026
Merged

fix: clear cron session in DB when not cached (cold-cache after restart)#424
viettranx merged 2 commits intomainfrom
fix/cron-session-reset-cold-cache

Conversation

@vanducng
Copy link
Copy Markdown
Contributor

Fixes #365

Problem

Cron-triggered agents get stuck in tool call loops after server restarts. The LLM sees contradictory history from previous runs and enters a loop trying to reconcile.

Root cause: PGSessionStore.Reset() only cleared the in-memory cache. After a restart the cache is empty, so Reset was a silent no-op. The subsequent GetOrCreate loaded the full accumulated history from DB — defeating the reset added in #294.

Flow (before fix)

Server restart → cache wiped
Cron fires → Reset(key) → cache empty → NO-OP
           → Save(key)  → cache empty → NO-OP
           → GetOrCreate(key) → loads bloated history from DB
           → LLM confused by stale context → tool loop → failure

Fix

When the session isn't in cache, Reset now issues a direct DB UPDATE to clear messages and summary:

UPDATE sessions SET messages = '[]', summary = ''
WHERE session_key = $1 AND tenant_id = $2

Warm-cache path is unchanged (early return before DB code).

Impact

All 3 callers of Reset() benefit:

  • Cron handler (gateway_cron.go) — target fix
  • Consumer handler (gateway_consumer_handlers.go) — same improvement
  • WS sessions method (sessions.go) — harmless, UI sessions are always warm

Test plan

  • go build ./... passes
  • go vet ./... passes
  • go test -race ./internal/store/pg/ — all pass
  • 4 unit tests prove the bug and verify the fix:
    • TestReset_ColdCache_FallsBackToDB — proves DB fallback is reached
    • TestReset_WarmCache_ClearsHistory — warm path unchanged
    • TestSave_ColdCache_IsNoOp — acceptable after fix
    • TestResetAfterGetOrCreate_FixVerification — alternative fix path works

Fixes #365

PGSessionStore.Reset() only cleared the in-memory cache. After a server
restart the cache is empty, so Reset was a no-op — the next GetOrCreate
loaded the full accumulated history from DB, causing LLM tool loops from
contradictory context.

Add a DB fallback: when the session isn't in cache, issue a direct
UPDATE to clear messages and summary in PostgreSQL. This ensures cron
sessions always start clean regardless of server restarts.

The #294 fix (Reset+Save before each cron run) already had the right
intent but only worked within the same server lifetime.
@mrgoonie
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 24, 2026

Claude finished @mrgoonie's task in 1m 36s —— View job


PR Review: fix/cron-session-reset-cold-cache

Tasks:

  • Gather context
  • Read changed files
  • Analyze the fix
  • Post review feedback

Summary

The fix is correct and addresses the root cause cleanly. Two issues worth addressing before merge, one of them a real concurrency bug.


Issues

🔴 Lock held during DB call (sessions_ops.go:33–49)

Reset() acquires s.mu.Lock() and holds it across the db.ExecContext() call in the cold-cache path. A DB call under a mutex blocks all concurrent cache reads and writes for the full round-trip duration.

Delete() (same file, lines 52–65) correctly releases the lock before touching the DB. Reset() should follow the same pattern: check + clear cache under the lock, then release before the DB fallback.

func (s *PGSessionStore) Reset(ctx context.Context, key string) {
    s.mu.Lock()
    if data, ok := s.cache[sessionCacheKey(ctx, key)]; ok {
        data.Messages = []providers.Message{}
        data.Summary = ""
        data.Updated = time.Now()
        s.mu.Unlock()
        return
    }
    s.mu.Unlock()

    // Now safe to do the DB call without holding the lock
    tid := tenantIDForInsert(ctx)
    s.db.ExecContext(ctx, ...)
}

Fix this →

🟡 DB error silently discarded (sessions_ops.go:45–49)

Reset() returns nothing (interface-mandated void), but the ExecContext result is not even captured:

s.db.ExecContext(ctx, ...)  // error ignored

Delete() returns and propagates err. Since Reset can't return an error without changing the interface, a warning log would at least make failures observable:

if _, err := s.db.ExecContext(ctx, ...); err != nil {
    slog.Warn("sessions.reset.db_fallback", "key", key, "error", err)
}

Fix this →


Observations (no action required)

Test approach (sessions_reset_test.go:15–40): TestReset_ColdCache_FallsBackToDB verifies the DB path is reached by catching the nil-db panic. It's clever and does prove the fix path is exercised, but it reads oddly — a reader has to understand that "panic = fix verified" is intentional. A brief clarifying comment in the test (or a t.Skip("integration test needed for full DB verify") note) would help future readers. Not a blocker.

TruncateHistory / SetHistory cold-cache gaps: These two methods in sessions_ops.go have the same silent no-op behavior on cold cache, but their callers (loop_history.go) are only invoked during an active agent run where the session is already warm. So they are not affected by this bug — just worth noting for completeness.


ExecContext was running under s.mu.Lock(), blocking all session cache
operations during DB round-trip. Release the lock before the DB call.
Also log ExecContext errors instead of silently discarding them —
silent failure defeats the purpose of the cold-cache fix.
@viettranx viettranx merged commit ebc82d3 into main Mar 24, 2026
2 checks passed
duhd-vnpay added a commit to duhd-vnpay/goclaw that referenced this pull request Mar 24, 2026
…n reliability, storage

Merged 42 upstream commits from nextlevelbuilder/goclaw origin/main.

Key upstream changes included:
- feat(config): system_configs DB table with per-tenant isolation + System Settings modal
- fix: cron reliability — clear session in DB on cold-cache restart (nextlevelbuilder#294, nextlevelbuilder#424)
- refactor(security): remove cross-tenant bypass, enforce strict tenant isolation
- fix(agent): set TenantID on streaming events in runLoop emitRun
- fix: add panic recovery to tool, cron, and summarization goroutines
- fix: check errors in cron store Scan/Unmarshal to prevent data corruption
- feat(teams): smart post-turn task decision + stale detection
- fix(skills): unblock agent read_file on skill paths
- feat(storage): bump upload limit to 50MB, add file upload to Team Workspace
- feat(gateway): version update checker
- fix(channels): propagate tenant_id to pending history

Conflict resolutions:
- gateway_cron.go: adopt upstream sessionMgr.Reset()+Save() pattern; revert Save() to UPDATE-only
- sessions_list.go: revert UPSERT → UPDATE-only (upstream fix covers cold-cache via explicit Save after Reset)
- upgrade/version.go: bump RequiredSchemaVersion 32→33 (adds upstream migration 33_system_configs)
- migrations: renumber upstream 000029_system_configs → 000033_system_configs (avoids conflict with local 000029_kg_entity_embeddings)
- gateway.go: keep local projects wiring + File ACL MediaServeHandler; drop WithCrossTenant on LoadAll per security refactor
- gateway_http_handlers.go: keep local wireHTTP signature with token + projectsH return
- media_serve.go: keep local File ACL implementation (scoped token, rate limiting)
- i18n/index.ts: union both namespace sets; add upstream "system-settings"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@viettranx viettranx deleted the fix/cron-session-reset-cold-cache branch April 2, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron Session Accumulation Causes Agent Tool Loops

3 participants