Skip to content

Lost-update race: gmail watch serve overwrites state set by gmail watch renew #526

@stozo04

Description

@stozo04

Summary

The gmail watch serve daemon holds an in-memory gmailWatchState initialized at startup and overwrites the entire on-disk JSON state file on every Update() call (which fires for every Pub/Sub push delivery, just to update LastDeliveryStatus / LastDeliveryAtMs). This silently clobbers any concurrent updates to ExpirationMs / ProviderExpirationMs made by independent gmail watch renew invocations — a classic lost-update race.

Symptoms: a long-running serve daemon plus periodic renew cron causes the on-disk state's expiration fields to be repeatedly reverted to whatever they were when the daemon last started. Subsequent gmail watch status reads report the stale value, not the most recent renewal's API response.

Affected version

Verified against main on 2026-04-25. The relevant code paths haven't been refactored recently per the surrounding history.

Reproduction

  1. Run gog gmail watch start --topic <t> --label INBOX (one-time setup).
  2. Start the push handler:
    gog gmail watch serve --bind 127.0.0.1 --port 8788 --hook-url http://127.0.0.1:18789/hooks/agent
    
    Leave it running.
  3. Send yourself an email, or otherwise generate a Gmail event that fires a Pub/Sub push to the daemon. Confirm via the daemon log that the push was handled (watch: hook delivered).
  4. From a separate terminal:
    gog gmail watch renew
    
    Note the expirationMs returned in the renewal output — call it T_renew.
  5. Immediately read disk state:
    cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
    
    Both values match T_renew. ✅
  6. Send another email (or wait for any Pub/Sub push). Confirm the daemon log shows the push was handled.
  7. Read disk state again:
    cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
    
    Both values have reverted to whatever the daemon held in memory at startup. ❌

Expected behavior

expirationMs and providerExpirationMs should reflect the most recent successful gmail watch renew API response, until the next renewal supersedes them.

Actual behavior

After step 6, the disk state's expiration fields are reset to the daemon's startup-time values. The renewal at step 4 is silently lost.

Evidence (production observation)

In our deployment we have:

  • A long-running gog gmail watch serve PM2 process (uptime ~3 days at observation time, last-loaded state on 2026-04-19)
  • A cron that runs gog gmail watch renew periodically
  • A heartbeat job that also calls renew if it observes expirationMs < 2 days from now

System logs from 2026-04-19 → 2026-04-25 show the same before_exp value (1777241652000 = 2026-04-26T22:14:12Z) on EVERY renewal log row, despite multiple successful renewals producing fresh +7d after_exp values:

Date (UTC) before_exp (read from gog watch status) after_exp (returned by gog watch renew)
2026-04-19 22:14:13 (cron) 1777222465000 1777241652000 ← daemon loaded this on startup
2026-04-20 18:22:53 (cron) 1777241652000 1777314172000
2026-04-21 15:37:51 (cron) 1777241652000 1777390670000
2026-04-25 00:03:51 (heartbeat) 1777241652000 1777680230000
2026-04-25 00:43:51 (heartbeat) 1777241652000 1777682630000

Every fresh after_exp was momentarily on disk after each renewal — verified by directly probing the file in a window where no Pub/Sub pushes were in flight. But the very next push handled by the daemon overwrote the file back to 1777241652000. The frozen value is exactly what the daemon's in-memory s.state.ExpirationMs was set to when it loaded state on startup.

historyId does NOT freeze (the daemon explicitly updates it via updateStateAfterHistory on each push), but expirationMs and providerExpirationMs aren't touched by any code path inside serve, so they retain their startup values forever — and Save() writes those forever-stale values to disk on every push.

Root cause

internal/cmd/gmail_watch_state.go:91-110:

func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    if err := fn(&s.state); err != nil {
        return err
    }
    return s.Save()  // writes WHOLE in-memory state
}

func (s *gmailWatchStore) Save() error {
    if s.path == "" {
        return errors.New("missing watch state path")
    }
    payload, err := json.MarshalIndent(s.state, "", "  ")  // serializes whole struct
    if err != nil {
        return err
    }
    return os.WriteFile(s.path, append(payload, '\n'), 0o600)
}

s.state is loaded ONCE at server startup via loadGmailWatchStore and never reloaded. The server only mutates a few fields (LastDeliveryStatus, LastDeliveryAtMs, LastDeliveryStatusNote, HistoryID, LastPushMessageID, UpdatedAtMs) but Save() writes the entire struct. Concurrent writers (the renew CLI) update the file directly, but their changes are clobbered on the next push.

The two writers don't coordinate via a file lock either — gmailWatchStore.mu is an in-process sync.Mutex, so it doesn't protect against a separate process invocation of gog watch renew that runs its own loadGmailWatchStore → mutate → Save() cycle.

Proposed fix (any one of three; pick whichever fits the codebase best)

Option 1 — Reload before mutate (smallest diff, recommended)

In Update(), re-read the file from disk before applying the mutation:

func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
    s.mu.Lock()
    defer s.mu.Unlock()

    // Re-read state from disk before mutating to pick up any concurrent writes.
    if data, err := os.ReadFile(s.path); err == nil {
        var fresh gmailWatchState
        if err := json.Unmarshal(data, &fresh); err == nil {
            s.state = fresh
        }
    } else if !errors.Is(err, os.ErrNotExist) {
        return fmt.Errorf("reload state: %w", err)
    }

    if err := fn(&s.state); err != nil {
        return err
    }
    return s.Save()
}

This makes serve always see the latest on-disk values, so its Save() no longer reverts concurrent writes. Adds one ReadFile per push (~1KB JSON parse, microseconds). The in-process mutex still serializes intra-process callers.

Caveat: still vulnerable to a TOCTOU race between two processes calling Update() simultaneously (use flock if you want strict cross-process serialization). For the observed bug this is fine because renew doesn't go through Update() — it calls Save() directly via *s = updated.

Option 2 — Selective field save

Add a SaveFields(fieldNames ...string) method that does a JSON merge-patch on disk: read existing JSON, mutate only the named keys, write back. Deeper change, but eliminates the "Save() writes everything" footgun entirely.

Option 3 — Split the storage

Move ephemeral delivery-status fields (LastDeliveryStatus, LastDeliveryAtMs, LastDeliveryStatusNote, LastPushMessageID) to a separate <account>.delivery.json sidecar file. The main state file becomes an immutable-after-renew document. No more shared write surface. Biggest refactor but cleanest separation.

I'd recommend Option 1 — it fixes the observed bug with the smallest possible diff.

Workaround for users hitting this now

Restart the serve daemon after every renew so it reloads fresh state into memory:

gog gmail watch renew && systemctl restart gog-watch-serve
# or, in PM2:
gog gmail watch renew && pm2 restart kayley-email-watch

This works but causes a brief Pub/Sub push delivery interruption (~5s) during the restart, and Pub/Sub will retry pushes that arrive during the gap, so it's not lossless.

Why this is sneaky

The bug is invisible if you only ever run serve OR renew, but never both concurrently. It surfaces specifically in deployments that:

  1. Run serve continuously (the standard pattern per docs/watch.md)
  2. Schedule independent renew calls (a cron, a watchdog, or a separate process)

Both patterns are blessed by the official docs, so any production deployment following the recommended setup will hit this.

The auto-recovery aspect makes it worse — many users won't notice the loop because the watch stays alive (each renew succeeds, just gets reverted minutes later). The first symptom is usually unexplained CPU/network traffic or an alert from a monitoring system that's tracking renewal cadence.

Tests

If accepted, suggest adding:

  • Unit test in gmail_watch_state_more_test.go: open a store, mutate file out-of-band (simulate concurrent renew), call Update() to mutate a different field, assert that out-of-band fields survived.
  • Integration test in gmail_watch_serve_test.go: spin up serve, write fresh expirationMs to disk, simulate a Pub/Sub push, assert disk expirationMs still matches the fresh value.

Happy to contribute either if helpful — this is a great little CLI and we want to keep using it. Thanks for building it. 💛

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions