Lost-update race: `gmail watch serve` overwrites state set by `gmail watch renew`

## Summary

The `gmail watch serve` daemon holds an in-memory `gmailWatchState` initialized at startup and overwrites the entire on-disk JSON state file on every `Update()` call (which fires for every Pub/Sub push delivery, just to update `LastDeliveryStatus` / `LastDeliveryAtMs`). This silently clobbers any concurrent updates to `ExpirationMs` / `ProviderExpirationMs` made by independent `gmail watch renew` invocations — a classic lost-update race.

Symptoms: a long-running `serve` daemon plus periodic `renew` cron causes the on-disk state's expiration fields to be repeatedly reverted to whatever they were when the daemon last started. Subsequent `gmail watch status` reads report the stale value, not the most recent renewal's API response.

## Affected version

Verified against `main` on 2026-04-25. The relevant code paths haven't been refactored recently per the surrounding history.

## Reproduction

1. Run `gog gmail watch start --topic <t> --label INBOX` (one-time setup).
2. Start the push handler:
   ```
   gog gmail watch serve --bind 127.0.0.1 --port 8788 --hook-url http://127.0.0.1:18789/hooks/agent
   ```
   Leave it running.
3. Send yourself an email, or otherwise generate a Gmail event that fires a Pub/Sub push to the daemon. Confirm via the daemon log that the push was handled (`watch: hook delivered`).
4. From a separate terminal:
   ```
   gog gmail watch renew
   ```
   Note the `expirationMs` returned in the renewal output — call it `T_renew`.
5. Immediately read disk state:
   ```
   cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
   ```
   Both values match `T_renew`. ✅
6. Send another email (or wait for any Pub/Sub push). Confirm the daemon log shows the push was handled.
7. Read disk state again:
   ```
   cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
   ```
   **Both values have reverted** to whatever the daemon held in memory at startup. ❌

## Expected behavior

`expirationMs` and `providerExpirationMs` should reflect the most recent successful `gmail watch renew` API response, until the next renewal supersedes them.

## Actual behavior

After step 6, the disk state's expiration fields are reset to the daemon's startup-time values. The renewal at step 4 is silently lost.

## Evidence (production observation)

In our deployment we have:
- A long-running `gog gmail watch serve` PM2 process (uptime ~3 days at observation time, last-loaded state on 2026-04-19)
- A cron that runs `gog gmail watch renew` periodically
- A heartbeat job that also calls `renew` if it observes `expirationMs` < 2 days from now

System logs from 2026-04-19 → 2026-04-25 show the same `before_exp` value (`1777241652000` = 2026-04-26T22:14:12Z) on EVERY renewal log row, despite multiple successful renewals producing fresh +7d `after_exp` values:

| Date (UTC) | `before_exp` (read from `gog watch status`) | `after_exp` (returned by `gog watch renew`) |
|---|---|---|
| 2026-04-19 22:14:13 (cron) | `1777222465000` | **`1777241652000`** ← daemon loaded this on startup |
| 2026-04-20 18:22:53 (cron) | `1777241652000` | `1777314172000` |
| 2026-04-21 15:37:51 (cron) | `1777241652000` | `1777390670000` |
| 2026-04-25 00:03:51 (heartbeat) | `1777241652000` | `1777680230000` |
| 2026-04-25 00:43:51 (heartbeat) | `1777241652000` | `1777682630000` |

Every fresh `after_exp` was momentarily on disk after each renewal — verified by directly probing the file in a window where no Pub/Sub pushes were in flight. But the very next push handled by the daemon overwrote the file back to `1777241652000`. The frozen value is exactly what the daemon's in-memory `s.state.ExpirationMs` was set to when it loaded state on startup.

`historyId` does NOT freeze (the daemon explicitly updates it via `updateStateAfterHistory` on each push), but `expirationMs` and `providerExpirationMs` aren't touched by any code path inside `serve`, so they retain their startup values forever — and `Save()` writes those forever-stale values to disk on every push.

## Root cause

`internal/cmd/gmail_watch_state.go:91-110`:

```go
func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    if err := fn(&s.state); err != nil {
        return err
    }
    return s.Save()  // writes WHOLE in-memory state
}

func (s *gmailWatchStore) Save() error {
    if s.path == "" {
        return errors.New("missing watch state path")
    }
    payload, err := json.MarshalIndent(s.state, "", "  ")  // serializes whole struct
    if err != nil {
        return err
    }
    return os.WriteFile(s.path, append(payload, '\n'), 0o600)
}
```

`s.state` is loaded ONCE at server startup via `loadGmailWatchStore` and never reloaded. The server only mutates a few fields (`LastDeliveryStatus`, `LastDeliveryAtMs`, `LastDeliveryStatusNote`, `HistoryID`, `LastPushMessageID`, `UpdatedAtMs`) but `Save()` writes the entire struct. Concurrent writers (the `renew` CLI) update the file directly, but their changes are clobbered on the next push.

The two writers don't coordinate via a file lock either — `gmailWatchStore.mu` is an in-process `sync.Mutex`, so it doesn't protect against a separate process invocation of `gog watch renew` that runs its own `loadGmailWatchStore` → mutate → `Save()` cycle.

## Proposed fix (any one of three; pick whichever fits the codebase best)

### Option 1 — Reload before mutate (smallest diff, recommended)

In `Update()`, re-read the file from disk before applying the mutation:

```go
func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
    s.mu.Lock()
    defer s.mu.Unlock()

    // Re-read state from disk before mutating to pick up any concurrent writes.
    if data, err := os.ReadFile(s.path); err == nil {
        var fresh gmailWatchState
        if err := json.Unmarshal(data, &fresh); err == nil {
            s.state = fresh
        }
    } else if !errors.Is(err, os.ErrNotExist) {
        return fmt.Errorf("reload state: %w", err)
    }

    if err := fn(&s.state); err != nil {
        return err
    }
    return s.Save()
}
```

This makes `serve` always see the latest on-disk values, so its `Save()` no longer reverts concurrent writes. Adds one `ReadFile` per push (~1KB JSON parse, microseconds). The in-process mutex still serializes intra-process callers.

Caveat: still vulnerable to a TOCTOU race between two processes calling `Update()` simultaneously (use `flock` if you want strict cross-process serialization). For the observed bug this is fine because `renew` doesn't go through `Update()` — it calls `Save()` directly via `*s = updated`.

### Option 2 — Selective field save

Add a `SaveFields(fieldNames ...string)` method that does a JSON merge-patch on disk: read existing JSON, mutate only the named keys, write back. Deeper change, but eliminates the "Save() writes everything" footgun entirely.

### Option 3 — Split the storage

Move ephemeral delivery-status fields (`LastDeliveryStatus`, `LastDeliveryAtMs`, `LastDeliveryStatusNote`, `LastPushMessageID`) to a separate `<account>.delivery.json` sidecar file. The main state file becomes an immutable-after-`renew` document. No more shared write surface. Biggest refactor but cleanest separation.

I'd recommend Option 1 — it fixes the observed bug with the smallest possible diff.

## Workaround for users hitting this now

Restart the `serve` daemon after every `renew` so it reloads fresh state into memory:

```bash
gog gmail watch renew && systemctl restart gog-watch-serve
# or, in PM2:
gog gmail watch renew && pm2 restart kayley-email-watch
```

This works but causes a brief Pub/Sub push delivery interruption (~5s) during the restart, and Pub/Sub will retry pushes that arrive during the gap, so it's not lossless.

## Why this is sneaky

The bug is invisible if you only ever run `serve` OR `renew`, but never both concurrently. It surfaces specifically in deployments that:
1. Run `serve` continuously (the standard pattern per `docs/watch.md`)
2. Schedule independent `renew` calls (a cron, a watchdog, or a separate process)

Both patterns are blessed by the official docs, so any production deployment following the recommended setup will hit this.

The auto-recovery aspect makes it worse — many users won't notice the loop because the watch stays alive (each `renew` succeeds, just gets reverted minutes later). The first symptom is usually unexplained CPU/network traffic or an alert from a monitoring system that's tracking renewal cadence.

## Tests

If accepted, suggest adding:

- Unit test in `gmail_watch_state_more_test.go`: open a store, mutate file out-of-band (simulate concurrent `renew`), call `Update()` to mutate a different field, assert that out-of-band fields survived.
- Integration test in `gmail_watch_serve_test.go`: spin up `serve`, write fresh `expirationMs` to disk, simulate a Pub/Sub push, assert disk `expirationMs` still matches the fresh value.

Happy to contribute either if helpful — this is a great little CLI and we want to keep using it. Thanks for building it. 💛

Date (UTC)	`before_exp` (read from `gog watch status`)	`after_exp` (returned by `gog watch renew`)
2026-04-19 22:14:13 (cron)	`1777222465000`	`1777241652000` ← daemon loaded this on startup
2026-04-20 18:22:53 (cron)	`1777241652000`	`1777314172000`
2026-04-21 15:37:51 (cron)	`1777241652000`	`1777390670000`
2026-04-25 00:03:51 (heartbeat)	`1777241652000`	`1777680230000`
2026-04-25 00:43:51 (heartbeat)	`1777241652000`	`1777682630000`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost-update race: `gmail watch serve` overwrites state set by `gmail watch renew` #526

Summary

Affected version

Reproduction

Expected behavior

Actual behavior

Evidence (production observation)

Root cause

Proposed fix (any one of three; pick whichever fits the codebase best)

Option 1 — Reload before mutate (smallest diff, recommended)

Option 2 — Selective field save

Option 3 — Split the storage

Workaround for users hitting this now

Why this is sneaky

Tests

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lost-update race: gmail watch serve overwrites state set by gmail watch renew #526

Description

Summary

Affected version

Reproduction

Expected behavior

Actual behavior

Evidence (production observation)

Root cause

Proposed fix (any one of three; pick whichever fits the codebase best)

Option 1 — Reload before mutate (smallest diff, recommended)

Option 2 — Selective field save

Option 3 — Split the storage

Workaround for users hitting this now

Why this is sneaky

Tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Lost-update race: `gmail watch serve` overwrites state set by `gmail watch renew` #526