SessionCache.flushPendingUpdates: serial per-session write loop saturates SQLite writer lock and starves control-plane writes (self-hosted light)

## Summary

In the self-hosted **light** variant (`main.light.ts` / `happier-server-light`, SQLite backend), `SessionCache.flushPendingUpdatesInternal` performs `db.session.updateMany` once per active session in a sequential `for-await` loop (`apps/server/sources/app/presence/sessionCache.ts`, around L350). With ≥6 concurrently-active sessions, the flush takes long enough to saturate SQLite's single-writer slot, which then starves unrelated control-plane writes — most visibly the daemon's `POST /v1/machines`. The daemon's axios call times out at 10s, retries forever, and clients show a permanent "disconnected" badge until the server is manually restarted.

The existing inline comment at the flush call site already names the failure mode:

> ```ts
> // On SQLite, concurrent write bursts can trigger busy contention and delay unrelated
> // control-plane requests (e.g. machine registration). Flush sequentially to reduce lock pressure.
> await db.session.updateMany(...)
> ```

Flushing sequentially does **not** reduce lock pressure — it stretches it across N sequential lock acquires. A single transaction or single bulk UPDATE collapses N writer-lock cycles into 1.

## Affected component

- Variant: self-hosted light only (Cloud / `main.ts` uses MySQL, unaffected)
- File: `apps/server/sources/app/presence/sessionCache.ts`
- Methods: `flushPendingUpdatesInternal` (session & machine flush loops, ~L311–L440)
- Verified against `dev` HEAD `379c9941` (2026-05-15) — pattern unchanged since v0.2.0

## Reproduction

1. Self-host with the **light** variant on SQLite
2. Open ≥6 active sessions across distinct machines (one per project works)
3. Drive normal session traffic (websocket activity)
4. Observe stack server log:
   - `Flushed 0/N session updates` repeated
   - `PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout)` at `sessionCache.ts:338`
   - Same `Socket timeout` originating from `machinesRoutes.ts:297` in the `inTx` call
5. Observe daemon log on the agent host:
   - Repeated `POST /v1/machines` `timeout of 10000ms exceeded` (`ECONNABORTED`)
   - Attempt counter climbs unbounded
6. Manual `systemctl restart dev.happier.stack.service` clears the backlog and registration succeeds again — for a window of minutes to hours — then the spiral returns

## Evidence captured

From a self-hosted install (CT 107, 6 GB RAM, 2 cores, SQLite 626 MB, WAL mode, `HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000`):

```
[server] [07:16:35.600] ERROR: Failed to save usage report: PrismaClientKnownRequestError:
[server] Invalid `db.usageReport.upsert()` invocation in
[server] /home/happier/.happier-stack/workspace/main/apps/server/sources/app/usage/usageReporter.ts:70:38
[server] Socket timeout (the database failed to respond to a query within the configured timeout).

[server] [07:16:40.608] ERROR: Error updating session: PrismaClientKnownRequestError:
[server] Invalid `db.session.updateMany()` invocation in
[server] /home/happier/.happier-stack/workspace/main/apps/server/sources/app/presence/sessionCache.ts:338:38
[server]   335 try {
[server]   336     // On SQLite, concurrent write bursts can trigger busy contention and delay unrelated
[server]   337     // control-plane requests (e.g. machine registration). Flush sequentially to reduce lock pressure.
[server] → 338     await db.session.updateMany(
[server] Socket timeout (the database failed to respond to a query within the configured timeout).

[server] [07:25:46.041] INFO: Flushed 0/9 session updates
[server] [07:27:11.156] INFO: Flushed 0/9 session updates
[server] [07:28:36.266] INFO: Flushed 0/9 session updates
[server] [07:31:16.621] INFO: Flushed 1/9 session updates
```

Daemon side (`happier-daemon` on a sibling LXC):

```
May 18 07:33:45 code happier[533120]: [DAEMON RUN] Machine registration unavailable; retrying {
May 18 07:33:45 code happier[533120]:   attempt: 51,
May 18 07:33:45 code happier[533120]:   message: "timeout of 10000ms exceeded",
May 18 07:33:45 code happier[533120]:   code: "ECONNABORTED",
May 18 07:33:45 code happier[533120]:   url: "https://happier.tail6018de.ts.net/v1/machines",
```

51 consecutive retries × 10 s = ~8.5 min of UI-visible disconnect.

After `systemctl restart dev.happier.stack.service`:

```
Socket timeouts since restart: 0
Flushed 1/1 session updates
Flushed 1/1 session updates
Daemon machineRegistered=true, no further /v1/machines failures
```

Until the next time the in-memory flush queue grows.

## Root cause

For N active sessions, every flush cycle issues N sequential `db.session.updateMany({ where: { id }, data: { ... } })` calls. Each one:

1. Acquires SQLite's writer lock
2. Holds it for the duration of one Prisma engine round-trip (well under 1 ms in isolation, but multi-ms under contention)
3. Releases

While that chain is running, every other write path (machine registration, usage report, machine presence flush, etc.) is queued behind it. Prisma's per-query internal IPC timeout (the **engine-level** "Socket timeout", default ~5 s for SQLite) cuts off any caller that doesn't get a writer-lock turn in time, regardless of the user-set `PRAGMA busy_timeout=30000`. This is what produces the cascade — once the queue depth exceeds the engine timeout window, every queued query fails simultaneously.

Restarting clears the in-memory `flushPendingUpdates` backlog and resets the queue, which is why manual restarts work — temporarily.

## Proposed fix

Replace the per-session sequential loop with a **single bulk write** per flush tick:

### Option A — Single transaction (smallest diff)

```ts
const updates = [...sessionUpdatesById.entries()].map(([sessionId, update]) =>
    db.session.updateMany({
        where: { id: sessionId },
        data: { lastActiveAt: new Date(update.timestamp), active: true },
    }),
);
await db.$transaction(updates);
```

One writer-lock acquire instead of N. Drop in replacement; trivial to roll back.

### Option B — Single raw `UPDATE` with CASE (best)

```ts
const ids = [...sessionUpdatesById.keys()];
const cases = Prisma.join(
    [...sessionUpdatesById.entries()].map(
        ([id, u]) => Prisma.sql`WHEN ${id} THEN ${new Date(u.timestamp)}`,
    ),
    ' ',
);
await db.$executeRaw`
    UPDATE Session
    SET active = 1,
        lastActiveAt = CASE id ${cases} END
    WHERE id IN (${Prisma.join(ids)})
`;
```

One SQL statement, one lock acquire, one engine round-trip. Eliminates the entire queue-depth-versus-timeout race.

Equivalent treatment for the machine flush loop directly below (L395-ish).

### Bookkeeping consideration

Current loop updates each entry's `lastUpdateSent` / `pendingUpdate` individually inside the `try`. After a batch write, that bookkeeping moves into a single post-success pass over `sessionUpdatesById`. The "all-or-nothing" semantics (vs current partial-success "Flushed N/M") is a small behavioral change — but in practice today it's already "Flushed 0/N" or "Flushed N/N" under contention, so the partial-success metric carries no information loss.

## Effort estimate

Small. ~30–50 lines incl. machine-flush twin. A few unit / integration test updates. Couple hours of dev work.

## Investigation done

- Read full `flushPendingUpdates` / `flushPendingUpdatesInternal` path
- Verified `flushInFlight` guard at L282 already prevents overlapping flushes — the contention is **within** one flush, not across them
- Verified `HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000` reaches the server child process env (`/proc/<pid>/environ` confirmed) and is consumed by `applySqliteRuntimePragmas`. The 5 s engine timeout fires regardless, indicating the engine's IPC timeout supersedes the PRAGMA wait window (see companion issue)
- Compared installed v0.2.0 to current dev HEAD `379c9941`; flush pattern is identical
- Ruled out: disk pressure (47% rootfs), RAM (642 MB used / 6 GB), swap (0 B / 512 MB), WAL bloat (5.7 MB pre-checkpoint, 0 B post), CPU (load 2.0 on 2 cores — high but not pinned), DB corruption (`PRAGMA integrity_check` ok), freelist (0 — DB vacuumed today), tailscale (proxy errors are downstream of stack timeouts, not the cause), daemon env drift, missing snapshots, log fill (out.log was 387 MB of `Auth success` spam but `StandardOutput=append:` so no blocking pipe)
- Confirmed Cloud variant (`main.ts`) on MySQL is structurally immune — only `flavors/light` (SQLite via `prisma/sqlite/schema.prisma`) is affected
- All commits on `dev` since v0.2.0 that touch presence / prisma / sqlite reviewed — none address this contention pattern

## Why this is worth prioritizing

Manual restart is the only known workaround. The MTBF scales inversely with concurrent session count, so growing usage worsens the symptom. Once it triggers, every client on the deployment shows "disconnected" until the operator notices and restarts the server. This is a top operator-toil item for self-hosters.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SessionCache.flushPendingUpdates: serial per-session write loop saturates SQLite writer lock and starves control-plane writes (self-hosted light) #175

Summary

Affected component

Reproduction

Evidence captured

Root cause

Proposed fix

Option A — Single transaction (smallest diff)

Option B — Single raw `UPDATE` with CASE (best)

Bookkeeping consideration

Effort estimate

Investigation done

Why this is worth prioritizing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SessionCache.flushPendingUpdates: serial per-session write loop saturates SQLite writer lock and starves control-plane writes (self-hosted light) #175

Description

Summary

Affected component

Reproduction

Evidence captured

Root cause

Proposed fix

Option A — Single transaction (smallest diff)

Option B — Single raw UPDATE with CASE (best)

Bookkeeping consideration

Effort estimate

Investigation done

Why this is worth prioritizing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option B — Single raw `UPDATE` with CASE (best)