Summary
In the self-hosted light variant (main.light.ts / happier-server-light, SQLite backend), SessionCache.flushPendingUpdatesInternal performs db.session.updateMany once per active session in a sequential for-await loop (apps/server/sources/app/presence/sessionCache.ts, around L350). With ≥6 concurrently-active sessions, the flush takes long enough to saturate SQLite's single-writer slot, which then starves unrelated control-plane writes — most visibly the daemon's POST /v1/machines. The daemon's axios call times out at 10s, retries forever, and clients show a permanent "disconnected" badge until the server is manually restarted.
The existing inline comment at the flush call site already names the failure mode:
// On SQLite, concurrent write bursts can trigger busy contention and delay unrelated
// control-plane requests (e.g. machine registration). Flush sequentially to reduce lock pressure.
await db.session.updateMany(...)
Flushing sequentially does not reduce lock pressure — it stretches it across N sequential lock acquires. A single transaction or single bulk UPDATE collapses N writer-lock cycles into 1.
Affected component
- Variant: self-hosted light only (Cloud /
main.ts uses MySQL, unaffected)
- File:
apps/server/sources/app/presence/sessionCache.ts
- Methods:
flushPendingUpdatesInternal (session & machine flush loops, ~L311–L440)
- Verified against
dev HEAD 379c9941 (2026-05-15) — pattern unchanged since v0.2.0
Reproduction
- Self-host with the light variant on SQLite
- Open ≥6 active sessions across distinct machines (one per project works)
- Drive normal session traffic (websocket activity)
- Observe stack server log:
Flushed 0/N session updates repeated
PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout) at sessionCache.ts:338
- Same
Socket timeout originating from machinesRoutes.ts:297 in the inTx call
- Observe daemon log on the agent host:
- Repeated
POST /v1/machines timeout of 10000ms exceeded (ECONNABORTED)
- Attempt counter climbs unbounded
- Manual
systemctl restart dev.happier.stack.service clears the backlog and registration succeeds again — for a window of minutes to hours — then the spiral returns
Evidence captured
From a self-hosted install (CT 107, 6 GB RAM, 2 cores, SQLite 626 MB, WAL mode, HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000):
[server] [07:16:35.600] ERROR: Failed to save usage report: PrismaClientKnownRequestError:
[server] Invalid `db.usageReport.upsert()` invocation in
[server] /home/happier/.happier-stack/workspace/main/apps/server/sources/app/usage/usageReporter.ts:70:38
[server] Socket timeout (the database failed to respond to a query within the configured timeout).
[server] [07:16:40.608] ERROR: Error updating session: PrismaClientKnownRequestError:
[server] Invalid `db.session.updateMany()` invocation in
[server] /home/happier/.happier-stack/workspace/main/apps/server/sources/app/presence/sessionCache.ts:338:38
[server] 335 try {
[server] 336 // On SQLite, concurrent write bursts can trigger busy contention and delay unrelated
[server] 337 // control-plane requests (e.g. machine registration). Flush sequentially to reduce lock pressure.
[server] → 338 await db.session.updateMany(
[server] Socket timeout (the database failed to respond to a query within the configured timeout).
[server] [07:25:46.041] INFO: Flushed 0/9 session updates
[server] [07:27:11.156] INFO: Flushed 0/9 session updates
[server] [07:28:36.266] INFO: Flushed 0/9 session updates
[server] [07:31:16.621] INFO: Flushed 1/9 session updates
Daemon side (happier-daemon on a sibling LXC):
May 18 07:33:45 code happier[533120]: [DAEMON RUN] Machine registration unavailable; retrying {
May 18 07:33:45 code happier[533120]: attempt: 51,
May 18 07:33:45 code happier[533120]: message: "timeout of 10000ms exceeded",
May 18 07:33:45 code happier[533120]: code: "ECONNABORTED",
May 18 07:33:45 code happier[533120]: url: "https://happier.tail6018de.ts.net/v1/machines",
51 consecutive retries × 10 s = ~8.5 min of UI-visible disconnect.
After systemctl restart dev.happier.stack.service:
Socket timeouts since restart: 0
Flushed 1/1 session updates
Flushed 1/1 session updates
Daemon machineRegistered=true, no further /v1/machines failures
Until the next time the in-memory flush queue grows.
Root cause
For N active sessions, every flush cycle issues N sequential db.session.updateMany({ where: { id }, data: { ... } }) calls. Each one:
- Acquires SQLite's writer lock
- Holds it for the duration of one Prisma engine round-trip (well under 1 ms in isolation, but multi-ms under contention)
- Releases
While that chain is running, every other write path (machine registration, usage report, machine presence flush, etc.) is queued behind it. Prisma's per-query internal IPC timeout (the engine-level "Socket timeout", default ~5 s for SQLite) cuts off any caller that doesn't get a writer-lock turn in time, regardless of the user-set PRAGMA busy_timeout=30000. This is what produces the cascade — once the queue depth exceeds the engine timeout window, every queued query fails simultaneously.
Restarting clears the in-memory flushPendingUpdates backlog and resets the queue, which is why manual restarts work — temporarily.
Proposed fix
Replace the per-session sequential loop with a single bulk write per flush tick:
Option A — Single transaction (smallest diff)
const updates = [...sessionUpdatesById.entries()].map(([sessionId, update]) =>
db.session.updateMany({
where: { id: sessionId },
data: { lastActiveAt: new Date(update.timestamp), active: true },
}),
);
await db.$transaction(updates);
One writer-lock acquire instead of N. Drop in replacement; trivial to roll back.
Option B — Single raw UPDATE with CASE (best)
const ids = [...sessionUpdatesById.keys()];
const cases = Prisma.join(
[...sessionUpdatesById.entries()].map(
([id, u]) => Prisma.sql`WHEN ${id} THEN ${new Date(u.timestamp)}`,
),
' ',
);
await db.$executeRaw`
UPDATE Session
SET active = 1,
lastActiveAt = CASE id ${cases} END
WHERE id IN (${Prisma.join(ids)})
`;
One SQL statement, one lock acquire, one engine round-trip. Eliminates the entire queue-depth-versus-timeout race.
Equivalent treatment for the machine flush loop directly below (L395-ish).
Bookkeeping consideration
Current loop updates each entry's lastUpdateSent / pendingUpdate individually inside the try. After a batch write, that bookkeeping moves into a single post-success pass over sessionUpdatesById. The "all-or-nothing" semantics (vs current partial-success "Flushed N/M") is a small behavioral change — but in practice today it's already "Flushed 0/N" or "Flushed N/N" under contention, so the partial-success metric carries no information loss.
Effort estimate
Small. ~30–50 lines incl. machine-flush twin. A few unit / integration test updates. Couple hours of dev work.
Investigation done
- Read full
flushPendingUpdates / flushPendingUpdatesInternal path
- Verified
flushInFlight guard at L282 already prevents overlapping flushes — the contention is within one flush, not across them
- Verified
HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000 reaches the server child process env (/proc/<pid>/environ confirmed) and is consumed by applySqliteRuntimePragmas. The 5 s engine timeout fires regardless, indicating the engine's IPC timeout supersedes the PRAGMA wait window (see companion issue)
- Compared installed v0.2.0 to current dev HEAD
379c9941; flush pattern is identical
- Ruled out: disk pressure (47% rootfs), RAM (642 MB used / 6 GB), swap (0 B / 512 MB), WAL bloat (5.7 MB pre-checkpoint, 0 B post), CPU (load 2.0 on 2 cores — high but not pinned), DB corruption (
PRAGMA integrity_check ok), freelist (0 — DB vacuumed today), tailscale (proxy errors are downstream of stack timeouts, not the cause), daemon env drift, missing snapshots, log fill (out.log was 387 MB of Auth success spam but StandardOutput=append: so no blocking pipe)
- Confirmed Cloud variant (
main.ts) on MySQL is structurally immune — only flavors/light (SQLite via prisma/sqlite/schema.prisma) is affected
- All commits on
dev since v0.2.0 that touch presence / prisma / sqlite reviewed — none address this contention pattern
Why this is worth prioritizing
Manual restart is the only known workaround. The MTBF scales inversely with concurrent session count, so growing usage worsens the symptom. Once it triggers, every client on the deployment shows "disconnected" until the operator notices and restarts the server. This is a top operator-toil item for self-hosters.
Summary
In the self-hosted light variant (
main.light.ts/happier-server-light, SQLite backend),SessionCache.flushPendingUpdatesInternalperformsdb.session.updateManyonce per active session in a sequentialfor-awaitloop (apps/server/sources/app/presence/sessionCache.ts, around L350). With ≥6 concurrently-active sessions, the flush takes long enough to saturate SQLite's single-writer slot, which then starves unrelated control-plane writes — most visibly the daemon'sPOST /v1/machines. The daemon's axios call times out at 10s, retries forever, and clients show a permanent "disconnected" badge until the server is manually restarted.The existing inline comment at the flush call site already names the failure mode:
Flushing sequentially does not reduce lock pressure — it stretches it across N sequential lock acquires. A single transaction or single bulk UPDATE collapses N writer-lock cycles into 1.
Affected component
main.tsuses MySQL, unaffected)apps/server/sources/app/presence/sessionCache.tsflushPendingUpdatesInternal(session & machine flush loops, ~L311–L440)devHEAD379c9941(2026-05-15) — pattern unchanged since v0.2.0Reproduction
Flushed 0/N session updatesrepeatedPrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout)atsessionCache.ts:338Socket timeoutoriginating frommachinesRoutes.ts:297in theinTxcallPOST /v1/machinestimeout of 10000ms exceeded(ECONNABORTED)systemctl restart dev.happier.stack.serviceclears the backlog and registration succeeds again — for a window of minutes to hours — then the spiral returnsEvidence captured
From a self-hosted install (CT 107, 6 GB RAM, 2 cores, SQLite 626 MB, WAL mode,
HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000):Daemon side (
happier-daemonon a sibling LXC):51 consecutive retries × 10 s = ~8.5 min of UI-visible disconnect.
After
systemctl restart dev.happier.stack.service:Until the next time the in-memory flush queue grows.
Root cause
For N active sessions, every flush cycle issues N sequential
db.session.updateMany({ where: { id }, data: { ... } })calls. Each one:While that chain is running, every other write path (machine registration, usage report, machine presence flush, etc.) is queued behind it. Prisma's per-query internal IPC timeout (the engine-level "Socket timeout", default ~5 s for SQLite) cuts off any caller that doesn't get a writer-lock turn in time, regardless of the user-set
PRAGMA busy_timeout=30000. This is what produces the cascade — once the queue depth exceeds the engine timeout window, every queued query fails simultaneously.Restarting clears the in-memory
flushPendingUpdatesbacklog and resets the queue, which is why manual restarts work — temporarily.Proposed fix
Replace the per-session sequential loop with a single bulk write per flush tick:
Option A — Single transaction (smallest diff)
One writer-lock acquire instead of N. Drop in replacement; trivial to roll back.
Option B — Single raw
UPDATEwith CASE (best)One SQL statement, one lock acquire, one engine round-trip. Eliminates the entire queue-depth-versus-timeout race.
Equivalent treatment for the machine flush loop directly below (L395-ish).
Bookkeeping consideration
Current loop updates each entry's
lastUpdateSent/pendingUpdateindividually inside thetry. After a batch write, that bookkeeping moves into a single post-success pass oversessionUpdatesById. The "all-or-nothing" semantics (vs current partial-success "Flushed N/M") is a small behavioral change — but in practice today it's already "Flushed 0/N" or "Flushed N/N" under contention, so the partial-success metric carries no information loss.Effort estimate
Small. ~30–50 lines incl. machine-flush twin. A few unit / integration test updates. Couple hours of dev work.
Investigation done
flushPendingUpdates/flushPendingUpdatesInternalpathflushInFlightguard at L282 already prevents overlapping flushes — the contention is within one flush, not across themHAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000reaches the server child process env (/proc/<pid>/environconfirmed) and is consumed byapplySqliteRuntimePragmas. The 5 s engine timeout fires regardless, indicating the engine's IPC timeout supersedes the PRAGMA wait window (see companion issue)379c9941; flush pattern is identicalPRAGMA integrity_checkok), freelist (0 — DB vacuumed today), tailscale (proxy errors are downstream of stack timeouts, not the cause), daemon env drift, missing snapshots, log fill (out.log was 387 MB ofAuth successspam butStandardOutput=append:so no blocking pipe)main.ts) on MySQL is structurally immune — onlyflavors/light(SQLite viaprisma/sqlite/schema.prisma) is affecteddevsince v0.2.0 that touch presence / prisma / sqlite reviewed — none address this contention patternWhy this is worth prioritizing
Manual restart is the only known workaround. The MTBF scales inversely with concurrent session count, so growing usage worsens the symptom. Once it triggers, every client on the deployment shows "disconnected" until the operator notices and restarts the server. This is a top operator-toil item for self-hosters.