Skip to content

Add ALB idle timeout configuration guidance for WebSocket workloads #8929

@sadohert

Description

@sadohert

Summary

The AWS ALB default idle timeout is 60 seconds. The Mattermost server sends WebSocket ping frames every 60 seconds. This creates a race condition where the ALB may drop a WebSocket connection at exactly the moment a ping is due, causing the client to see a ~10s TCP timeout on the next write.

Evidence

Lab load test with default 60s ALB timeout showed P99 latency of 8.9s and 61% timeout error rate, preventing the coordinator from scaling past ~600 users. The same workload through nginx (proxy_read_timeout 90s) ran to 1,500 users without issue. Mattermost WebSocket client source confirms the server-side ping interval is exactly 60s.

Recommendation

Add a configuration note to the AWS deployment / load balancer documentation recommending ALB idle timeout be set to 300 seconds (5 minutes):

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn <alb-arn> \
  --attributes Key=idle_timeout.timeout_seconds,Value=300

Or via AWS Console: EC2 → Load Balancers → select ALB → Attributes → Edit → Idle timeout → 300.

Additional Investigation Note

Pull CloudWatch metrics during next high-load event (e.g. 5am EST spike):

  • ActiveConnectionCount
  • ConsumedLCUs
  • RejectedConnectionCount
  • TargetResponseTime

If RejectedConnectionCount > 0 or ConsumedLCUs is near limits, an NLB may be a better long-term fit for this WebSocket-heavy workload.

Workstream

Implementation & Onboarding — validated during Midmarket HA Postgres load testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions