Skip to content

Watch/WatchTree silently miss all events after a Sentinel failover #7

@Kuppit

Description

@Kuppit

Hi,

Cross-posting from traefik/traefik#12965 since the actual fix lives here. cc @nmengin who already has context on the Traefik side.

I spent some time digging into this with Claude Opus 4.7 to figure out exactly what was going on, and the result points pretty clearly at kvtools/redis rather than at Traefik.

Quick summary: in makeStore (redis.go:181), we run CONFIG SET notify-keyspace-events KEA once, at store creation time, on whatever the current master is. That config is local to the Redis instance, it is not replicated to replicas. When Sentinel promotes a replica to master after a failover, the new master does not have the config, it stops emitting any keyspace event, and every Watch / WatchTree consumer ends up subscribed to a silent channel forever (or until the store is recreated).

I reproduced it with a minimal docker-compose (master + replica + sentinel + traefik). Before failover, Redis routes flow through to Traefik fine. I stop the master, wait for the replica to be promoted, write new routes to the new master, they never show up. If I run CONFIG SET notify-keyspace-events KEA by hand on the new master and do nothing else, the routes appear immediately (both the old and the new ones). So that is really the only thing missing.

Worth noting: contrary to what we initially thought in the Traefik issue, this is not a stuck PubSub problem. The go-redis logs show the PubSub does reconnect to the new master correctly, it just listens on a channel that emits nothing.

The bug affects every consumer of the lib that uses Watch or WatchTree in Sentinel mode, not just Traefik. Traefik is just the most visible consumer.

Two possible approaches for the fix:

  1. Reapply the CONFIG SET at the start of every Watch and WatchTree call. Since consumers re-call those methods after each connection failure (which is what Traefik does via its retry loop), it guarantees the config is always applied to the current master. A few lines per method, simple, also defensive against a possible CONFIG RESETSTAT.

  2. Use go-redis's OnConnect hook in newClient to reapply the config on every (re)connection. Architecturally cleaner, but a bit more subtle and slightly more code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions