Skip to content

bug(discord): rollout restart causes silent Gateway session conflict — bot stops receiving events #455

@thepagent

Description

@thepagent

Description

When kubectl rollout restart is used to restart an OpenAB deployment, the new pod connects to Discord Gateway while the old pod is still in graceful termination (still holding its Gateway session). Discord sees two concurrent Gateway sessions for the same bot token and silently drops the new one — no error, no disconnect, just zero events delivered.

The bot logs show discord bot connected user=AgentBroker but never receives any message events. This is extremely hard to diagnose because there is no error or warning in the logs.

  ┌─ rollout restart (BROKEN) ──────────────────────────────────────────┐
  │                                                                     │
  │  time ──────────────────────────────────────────────────────►       │
  │                                                                     │
  │  Old Pod  ████████████████████░░░░░░░░░░                            │
  │           │ Gateway A (active) │ terminating │                      │
  │           │ receives events ✅  │ still open  │                      │
  │                                                                     │
  │  New Pod            ░░░░████████████████████████████████            │
  │                     │init│ Gateway B connected                      │
  │                          │ "bot connected" in logs ✅                │
  │                          │ receives events ❌ (silently dropped)     │
  │                                                                     │
  │  Discord   ──────────────┤                                          │
  │  Gateway:  2 sessions    │ same token = drop newer session          │
  │            for same      │ no error sent to client                  │
  │            bot token     │                                          │
  └─────────────────────────────────────────────────────────────────────┘

  ┌─ scale 0 → 1 (WORKS) ──────────────────────────────────────────────┐
  │                                                                     │
  │  time ──────────────────────────────────────────────────────►       │
  │                                                                     │
  │  Old Pod  ████████████░░░                                           │
  │           │ Gateway A  │ terminated                                 │
  │           │            │ session closed ✅                           │
  │                                                                     │
  │                    ← 5s gap →                                       │
  │                                                                     │
  │  New Pod                    ░░░░████████████████████████            │
  │                             │init│ Gateway B connected              │
  │                                  │ only session ✅                   │
  │                                  │ receives events ✅                │
  │                                                                     │
  │  Discord   ──────────────────────┤                                  │
  │  Gateway:  1 session at a time   │ events delivered normally        │
  └─────────────────────────────────────────────────────────────────────┘

Workaround:

kubectl scale deployment/openab-kiro --replicas=0 && sleep 5 && kubectl scale deployment/openab-kiro --replicas=1

Steps to Reproduce

  1. Deploy OpenAB with Discord adapter on Kubernetes
  2. Run kubectl rollout restart deployment/openab-kiro
  3. Wait for new pod to show 1/1 Running and logs show discord bot connected
  4. Send @Bot hello in the allowed Discord channel
  5. Observe: no response, no log entries for the message event

Expected Behavior

After rollout restart, the new pod should receive Discord Gateway events normally. Suggested fixes:

  • Option A (recommended): Add a preStop hook that explicitly closes the Discord Gateway connection (send close frame / shutdown shard) before the pod terminates, so the old session is gone before the new pod connects
  • Option B: Add a startup probe or health check that detects "connected but no events received within N seconds" and forces a Gateway reconnect
  • Option C: Document the scale 0 → 1 workaround in the troubleshooting guide

Environment

  • OpenAB v0.7.8-beta.5 (ghcr.io/openabdev/openab:0.7.8-beta.5)
  • Kubernetes: OrbStack (local k3s)
  • Deployment strategy: Recreate (PVC-backed)
  • Discord library: serenity 0.12.x
  • Observed with AgentBroker (kiro-cli agent) — AgentDealer on the same cluster with a different bot token was unaffected

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions