fix: clear cron session in DB when not cached (cold-cache after restart) by vanducng · Pull Request #424 · nextlevelbuilder/goclaw

vanducng · 2026-03-24T01:24:05Z

Fixes #365

Problem

Cron-triggered agents get stuck in tool call loops after server restarts. The LLM sees contradictory history from previous runs and enters a loop trying to reconcile.

Root cause: PGSessionStore.Reset() only cleared the in-memory cache. After a restart the cache is empty, so Reset was a silent no-op. The subsequent GetOrCreate loaded the full accumulated history from DB — defeating the reset added in #294.

Flow (before fix)

Server restart → cache wiped
Cron fires → Reset(key) → cache empty → NO-OP
           → Save(key)  → cache empty → NO-OP
           → GetOrCreate(key) → loads bloated history from DB
           → LLM confused by stale context → tool loop → failure

Fix

When the session isn't in cache, Reset now issues a direct DB UPDATE to clear messages and summary:

UPDATE sessions SET messages = '[]', summary = ''
WHERE session_key = $1 AND tenant_id = $2

Warm-cache path is unchanged (early return before DB code).

Impact

All 3 callers of Reset() benefit:

Cron handler (gateway_cron.go) — target fix
Consumer handler (gateway_consumer_handlers.go) — same improvement
WS sessions method (sessions.go) — harmless, UI sessions are always warm

Test plan

go build ./... passes
go vet ./... passes
go test -race ./internal/store/pg/ — all pass
4 unit tests prove the bug and verify the fix:
- TestReset_ColdCache_FallsBackToDB — proves DB fallback is reached
- TestReset_WarmCache_ClearsHistory — warm path unchanged
- TestSave_ColdCache_IsNoOp — acceptable after fix
- TestResetAfterGetOrCreate_FixVerification — alternative fix path works

Fixes #365 PGSessionStore.Reset() only cleared the in-memory cache. After a server restart the cache is empty, so Reset was a no-op — the next GetOrCreate loaded the full accumulated history from DB, causing LLM tool loops from contradictory context. Add a DB fallback: when the session isn't in cache, issue a direct UPDATE to clear messages and summary in PostgreSQL. This ensures cron sessions always start clean regardless of server restarts. The #294 fix (Reset+Save before each cron run) already had the right intent but only worked within the same server lifetime.

mrgoonie · 2026-03-24T08:41:44Z

@claude review

claude · 2026-03-24T08:42:00Z

Claude finished @mrgoonie's task in 1m 36s —— View job

PR Review: fix/cron-session-reset-cold-cache

Tasks:

Gather context
Read changed files
Analyze the fix
Post review feedback

Summary

The fix is correct and addresses the root cause cleanly. Two issues worth addressing before merge, one of them a real concurrency bug.

Issues

🔴 Lock held during DB call (`sessions_ops.go:33–49`)

Reset() acquires s.mu.Lock() and holds it across the db.ExecContext() call in the cold-cache path. A DB call under a mutex blocks all concurrent cache reads and writes for the full round-trip duration.

Delete() (same file, lines 52–65) correctly releases the lock before touching the DB. Reset() should follow the same pattern: check + clear cache under the lock, then release before the DB fallback.

func (s *PGSessionStore) Reset(ctx context.Context, key string) {
    s.mu.Lock()
    if data, ok := s.cache[sessionCacheKey(ctx, key)]; ok {
        data.Messages = []providers.Message{}
        data.Summary = ""
        data.Updated = time.Now()
        s.mu.Unlock()
        return
    }
    s.mu.Unlock()

    // Now safe to do the DB call without holding the lock
    tid := tenantIDForInsert(ctx)
    s.db.ExecContext(ctx, ...)
}

Fix this →

🟡 DB error silently discarded (`sessions_ops.go:45–49`)

Reset() returns nothing (interface-mandated void), but the ExecContext result is not even captured:

s.db.ExecContext(ctx, ...)  // error ignored

Delete() returns and propagates err. Since Reset can't return an error without changing the interface, a warning log would at least make failures observable:

if _, err := s.db.ExecContext(ctx, ...); err != nil {
    slog.Warn("sessions.reset.db_fallback", "key", key, "error", err)
}

Fix this →

Observations (no action required)

Test approach (sessions_reset_test.go:15–40): TestReset_ColdCache_FallsBackToDB verifies the DB path is reached by catching the nil-db panic. It's clever and does prove the fix path is exercised, but it reads oddly — a reader has to understand that "panic = fix verified" is intentional. A brief clarifying comment in the test (or a t.Skip("integration test needed for full DB verify") note) would help future readers. Not a blocker.

TruncateHistory / SetHistory cold-cache gaps: These two methods in sessions_ops.go have the same silent no-op behavior on cold cache, but their callers (loop_history.go) are only invoked during an active agent run where the session is already warm. So they are not affected by this bug — just worth noting for completeness.

ExecContext was running under s.mu.Lock(), blocking all session cache operations during DB round-trip. Release the lock before the DB call. Also log ExecContext errors instead of silently discarding them — silent failure defeats the purpose of the cold-cache fix.

…n reliability, storage Merged 42 upstream commits from nextlevelbuilder/goclaw origin/main. Key upstream changes included: - feat(config): system_configs DB table with per-tenant isolation + System Settings modal - fix: cron reliability — clear session in DB on cold-cache restart (nextlevelbuilder#294, nextlevelbuilder#424) - refactor(security): remove cross-tenant bypass, enforce strict tenant isolation - fix(agent): set TenantID on streaming events in runLoop emitRun - fix: add panic recovery to tool, cron, and summarization goroutines - fix: check errors in cron store Scan/Unmarshal to prevent data corruption - feat(teams): smart post-turn task decision + stale detection - fix(skills): unblock agent read_file on skill paths - feat(storage): bump upload limit to 50MB, add file upload to Team Workspace - feat(gateway): version update checker - fix(channels): propagate tenant_id to pending history Conflict resolutions: - gateway_cron.go: adopt upstream sessionMgr.Reset()+Save() pattern; revert Save() to UPDATE-only - sessions_list.go: revert UPSERT → UPDATE-only (upstream fix covers cold-cache via explicit Save after Reset) - upgrade/version.go: bump RequiredSchemaVersion 32→33 (adds upstream migration 33_system_configs) - migrations: renumber upstream 000029_system_configs → 000033_system_configs (avoids conflict with local 000029_kg_entity_embeddings) - gateway.go: keep local projects wiring + File ACL MediaServeHandler; drop WithCrossTenant on LoadAll per security refactor - gateway_http_handlers.go: keep local wireHTTP signature with token + projectsH return - media_serve.go: keep local File ACL implementation (scoped token, rate limiting) - i18n/index.ts: union both namespace sets; add upstream "system-settings" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

viettranx merged commit ebc82d3 into main Mar 24, 2026
2 checks passed

viettranx deleted the fix/cron-session-reset-cold-cache branch April 2, 2026 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clear cron session in DB when not cached (cold-cache after restart)#424

fix: clear cron session in DB when not cached (cold-cache after restart)#424
viettranx merged 2 commits intomainfrom
fix/cron-session-reset-cold-cache

vanducng commented Mar 24, 2026

Uh oh!

mrgoonie commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vanducng commented Mar 24, 2026

Problem

Flow (before fix)

Fix

Impact

Test plan

Uh oh!

mrgoonie commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix/cron-session-reset-cold-cache

Summary

Issues

🔴 Lock held during DB call (sessions_ops.go:33–49)

🟡 DB error silently discarded (sessions_ops.go:45–49)

Observations (no action required)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Mar 24, 2026 •

edited

Loading

🔴 Lock held during DB call (`sessions_ops.go:33–49`)

🟡 DB error silently discarded (`sessions_ops.go:45–49`)