Summary
The gmail watch serve daemon holds an in-memory gmailWatchState initialized at startup and overwrites the entire on-disk JSON state file on every Update() call (which fires for every Pub/Sub push delivery, just to update LastDeliveryStatus / LastDeliveryAtMs). This silently clobbers any concurrent updates to ExpirationMs / ProviderExpirationMs made by independent gmail watch renew invocations — a classic lost-update race.
Symptoms: a long-running serve daemon plus periodic renew cron causes the on-disk state's expiration fields to be repeatedly reverted to whatever they were when the daemon last started. Subsequent gmail watch status reads report the stale value, not the most recent renewal's API response.
Affected version
Verified against main on 2026-04-25. The relevant code paths haven't been refactored recently per the surrounding history.
Reproduction
- Run
gog gmail watch start --topic <t> --label INBOX (one-time setup).
- Start the push handler:
gog gmail watch serve --bind 127.0.0.1 --port 8788 --hook-url http://127.0.0.1:18789/hooks/agent
Leave it running.
- Send yourself an email, or otherwise generate a Gmail event that fires a Pub/Sub push to the daemon. Confirm via the daemon log that the push was handled (
watch: hook delivered).
- From a separate terminal:
Note the
expirationMs returned in the renewal output — call it T_renew.
- Immediately read disk state:
cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
Both values match T_renew. ✅
- Send another email (or wait for any Pub/Sub push). Confirm the daemon log shows the push was handled.
- Read disk state again:
cat ~/.config/gogcli/state/gmail-watch/<account>.json | jq '.expirationMs, .providerExpirationMs'
Both values have reverted to whatever the daemon held in memory at startup. ❌
Expected behavior
expirationMs and providerExpirationMs should reflect the most recent successful gmail watch renew API response, until the next renewal supersedes them.
Actual behavior
After step 6, the disk state's expiration fields are reset to the daemon's startup-time values. The renewal at step 4 is silently lost.
Evidence (production observation)
In our deployment we have:
- A long-running
gog gmail watch serve PM2 process (uptime ~3 days at observation time, last-loaded state on 2026-04-19)
- A cron that runs
gog gmail watch renew periodically
- A heartbeat job that also calls
renew if it observes expirationMs < 2 days from now
System logs from 2026-04-19 → 2026-04-25 show the same before_exp value (1777241652000 = 2026-04-26T22:14:12Z) on EVERY renewal log row, despite multiple successful renewals producing fresh +7d after_exp values:
| Date (UTC) |
before_exp (read from gog watch status) |
after_exp (returned by gog watch renew) |
| 2026-04-19 22:14:13 (cron) |
1777222465000 |
1777241652000 ← daemon loaded this on startup |
| 2026-04-20 18:22:53 (cron) |
1777241652000 |
1777314172000 |
| 2026-04-21 15:37:51 (cron) |
1777241652000 |
1777390670000 |
| 2026-04-25 00:03:51 (heartbeat) |
1777241652000 |
1777680230000 |
| 2026-04-25 00:43:51 (heartbeat) |
1777241652000 |
1777682630000 |
Every fresh after_exp was momentarily on disk after each renewal — verified by directly probing the file in a window where no Pub/Sub pushes were in flight. But the very next push handled by the daemon overwrote the file back to 1777241652000. The frozen value is exactly what the daemon's in-memory s.state.ExpirationMs was set to when it loaded state on startup.
historyId does NOT freeze (the daemon explicitly updates it via updateStateAfterHistory on each push), but expirationMs and providerExpirationMs aren't touched by any code path inside serve, so they retain their startup values forever — and Save() writes those forever-stale values to disk on every push.
Root cause
internal/cmd/gmail_watch_state.go:91-110:
func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
s.mu.Lock()
defer s.mu.Unlock()
if err := fn(&s.state); err != nil {
return err
}
return s.Save() // writes WHOLE in-memory state
}
func (s *gmailWatchStore) Save() error {
if s.path == "" {
return errors.New("missing watch state path")
}
payload, err := json.MarshalIndent(s.state, "", " ") // serializes whole struct
if err != nil {
return err
}
return os.WriteFile(s.path, append(payload, '\n'), 0o600)
}
s.state is loaded ONCE at server startup via loadGmailWatchStore and never reloaded. The server only mutates a few fields (LastDeliveryStatus, LastDeliveryAtMs, LastDeliveryStatusNote, HistoryID, LastPushMessageID, UpdatedAtMs) but Save() writes the entire struct. Concurrent writers (the renew CLI) update the file directly, but their changes are clobbered on the next push.
The two writers don't coordinate via a file lock either — gmailWatchStore.mu is an in-process sync.Mutex, so it doesn't protect against a separate process invocation of gog watch renew that runs its own loadGmailWatchStore → mutate → Save() cycle.
Proposed fix (any one of three; pick whichever fits the codebase best)
Option 1 — Reload before mutate (smallest diff, recommended)
In Update(), re-read the file from disk before applying the mutation:
func (s *gmailWatchStore) Update(fn func(*gmailWatchState) error) error {
s.mu.Lock()
defer s.mu.Unlock()
// Re-read state from disk before mutating to pick up any concurrent writes.
if data, err := os.ReadFile(s.path); err == nil {
var fresh gmailWatchState
if err := json.Unmarshal(data, &fresh); err == nil {
s.state = fresh
}
} else if !errors.Is(err, os.ErrNotExist) {
return fmt.Errorf("reload state: %w", err)
}
if err := fn(&s.state); err != nil {
return err
}
return s.Save()
}
This makes serve always see the latest on-disk values, so its Save() no longer reverts concurrent writes. Adds one ReadFile per push (~1KB JSON parse, microseconds). The in-process mutex still serializes intra-process callers.
Caveat: still vulnerable to a TOCTOU race between two processes calling Update() simultaneously (use flock if you want strict cross-process serialization). For the observed bug this is fine because renew doesn't go through Update() — it calls Save() directly via *s = updated.
Option 2 — Selective field save
Add a SaveFields(fieldNames ...string) method that does a JSON merge-patch on disk: read existing JSON, mutate only the named keys, write back. Deeper change, but eliminates the "Save() writes everything" footgun entirely.
Option 3 — Split the storage
Move ephemeral delivery-status fields (LastDeliveryStatus, LastDeliveryAtMs, LastDeliveryStatusNote, LastPushMessageID) to a separate <account>.delivery.json sidecar file. The main state file becomes an immutable-after-renew document. No more shared write surface. Biggest refactor but cleanest separation.
I'd recommend Option 1 — it fixes the observed bug with the smallest possible diff.
Workaround for users hitting this now
Restart the serve daemon after every renew so it reloads fresh state into memory:
gog gmail watch renew && systemctl restart gog-watch-serve
# or, in PM2:
gog gmail watch renew && pm2 restart kayley-email-watch
This works but causes a brief Pub/Sub push delivery interruption (~5s) during the restart, and Pub/Sub will retry pushes that arrive during the gap, so it's not lossless.
Why this is sneaky
The bug is invisible if you only ever run serve OR renew, but never both concurrently. It surfaces specifically in deployments that:
- Run
serve continuously (the standard pattern per docs/watch.md)
- Schedule independent
renew calls (a cron, a watchdog, or a separate process)
Both patterns are blessed by the official docs, so any production deployment following the recommended setup will hit this.
The auto-recovery aspect makes it worse — many users won't notice the loop because the watch stays alive (each renew succeeds, just gets reverted minutes later). The first symptom is usually unexplained CPU/network traffic or an alert from a monitoring system that's tracking renewal cadence.
Tests
If accepted, suggest adding:
- Unit test in
gmail_watch_state_more_test.go: open a store, mutate file out-of-band (simulate concurrent renew), call Update() to mutate a different field, assert that out-of-band fields survived.
- Integration test in
gmail_watch_serve_test.go: spin up serve, write fresh expirationMs to disk, simulate a Pub/Sub push, assert disk expirationMs still matches the fresh value.
Happy to contribute either if helpful — this is a great little CLI and we want to keep using it. Thanks for building it. 💛
Summary
The
gmail watch servedaemon holds an in-memorygmailWatchStateinitialized at startup and overwrites the entire on-disk JSON state file on everyUpdate()call (which fires for every Pub/Sub push delivery, just to updateLastDeliveryStatus/LastDeliveryAtMs). This silently clobbers any concurrent updates toExpirationMs/ProviderExpirationMsmade by independentgmail watch renewinvocations — a classic lost-update race.Symptoms: a long-running
servedaemon plus periodicrenewcron causes the on-disk state's expiration fields to be repeatedly reverted to whatever they were when the daemon last started. Subsequentgmail watch statusreads report the stale value, not the most recent renewal's API response.Affected version
Verified against
mainon 2026-04-25. The relevant code paths haven't been refactored recently per the surrounding history.Reproduction
gog gmail watch start --topic <t> --label INBOX(one-time setup).watch: hook delivered).expirationMsreturned in the renewal output — call itT_renew.T_renew. ✅Expected behavior
expirationMsandproviderExpirationMsshould reflect the most recent successfulgmail watch renewAPI response, until the next renewal supersedes them.Actual behavior
After step 6, the disk state's expiration fields are reset to the daemon's startup-time values. The renewal at step 4 is silently lost.
Evidence (production observation)
In our deployment we have:
gog gmail watch servePM2 process (uptime ~3 days at observation time, last-loaded state on 2026-04-19)gog gmail watch renewperiodicallyrenewif it observesexpirationMs< 2 days from nowSystem logs from 2026-04-19 → 2026-04-25 show the same
before_expvalue (1777241652000= 2026-04-26T22:14:12Z) on EVERY renewal log row, despite multiple successful renewals producing fresh +7dafter_expvalues:before_exp(read fromgog watch status)after_exp(returned bygog watch renew)17772224650001777241652000← daemon loaded this on startup17772416520001777314172000177724165200017773906700001777241652000177768023000017772416520001777682630000Every fresh
after_expwas momentarily on disk after each renewal — verified by directly probing the file in a window where no Pub/Sub pushes were in flight. But the very next push handled by the daemon overwrote the file back to1777241652000. The frozen value is exactly what the daemon's in-memorys.state.ExpirationMswas set to when it loaded state on startup.historyIddoes NOT freeze (the daemon explicitly updates it viaupdateStateAfterHistoryon each push), butexpirationMsandproviderExpirationMsaren't touched by any code path insideserve, so they retain their startup values forever — andSave()writes those forever-stale values to disk on every push.Root cause
internal/cmd/gmail_watch_state.go:91-110:s.stateis loaded ONCE at server startup vialoadGmailWatchStoreand never reloaded. The server only mutates a few fields (LastDeliveryStatus,LastDeliveryAtMs,LastDeliveryStatusNote,HistoryID,LastPushMessageID,UpdatedAtMs) butSave()writes the entire struct. Concurrent writers (therenewCLI) update the file directly, but their changes are clobbered on the next push.The two writers don't coordinate via a file lock either —
gmailWatchStore.muis an in-processsync.Mutex, so it doesn't protect against a separate process invocation ofgog watch renewthat runs its ownloadGmailWatchStore→ mutate →Save()cycle.Proposed fix (any one of three; pick whichever fits the codebase best)
Option 1 — Reload before mutate (smallest diff, recommended)
In
Update(), re-read the file from disk before applying the mutation:This makes
servealways see the latest on-disk values, so itsSave()no longer reverts concurrent writes. Adds oneReadFileper push (~1KB JSON parse, microseconds). The in-process mutex still serializes intra-process callers.Caveat: still vulnerable to a TOCTOU race between two processes calling
Update()simultaneously (useflockif you want strict cross-process serialization). For the observed bug this is fine becauserenewdoesn't go throughUpdate()— it callsSave()directly via*s = updated.Option 2 — Selective field save
Add a
SaveFields(fieldNames ...string)method that does a JSON merge-patch on disk: read existing JSON, mutate only the named keys, write back. Deeper change, but eliminates the "Save() writes everything" footgun entirely.Option 3 — Split the storage
Move ephemeral delivery-status fields (
LastDeliveryStatus,LastDeliveryAtMs,LastDeliveryStatusNote,LastPushMessageID) to a separate<account>.delivery.jsonsidecar file. The main state file becomes an immutable-after-renewdocument. No more shared write surface. Biggest refactor but cleanest separation.I'd recommend Option 1 — it fixes the observed bug with the smallest possible diff.
Workaround for users hitting this now
Restart the
servedaemon after everyrenewso it reloads fresh state into memory:This works but causes a brief Pub/Sub push delivery interruption (~5s) during the restart, and Pub/Sub will retry pushes that arrive during the gap, so it's not lossless.
Why this is sneaky
The bug is invisible if you only ever run
serveORrenew, but never both concurrently. It surfaces specifically in deployments that:servecontinuously (the standard pattern perdocs/watch.md)renewcalls (a cron, a watchdog, or a separate process)Both patterns are blessed by the official docs, so any production deployment following the recommended setup will hit this.
The auto-recovery aspect makes it worse — many users won't notice the loop because the watch stays alive (each
renewsucceeds, just gets reverted minutes later). The first symptom is usually unexplained CPU/network traffic or an alert from a monitoring system that's tracking renewal cadence.Tests
If accepted, suggest adding:
gmail_watch_state_more_test.go: open a store, mutate file out-of-band (simulate concurrentrenew), callUpdate()to mutate a different field, assert that out-of-band fields survived.gmail_watch_serve_test.go: spin upserve, write freshexpirationMsto disk, simulate a Pub/Sub push, assert diskexpirationMsstill matches the fresh value.Happy to contribute either if helpful — this is a great little CLI and we want to keep using it. Thanks for building it. 💛