Skip to content

Info: Everything we do on 1.0.6 to minimise 503s #12183

Closed
@Stono

Description

@Stono

I've had several chats with people recently so I'm putting there here to capture everything we do on 1.0.6 to deal with 503's, and then people can tell me what shouldn't be required as of 1.1.x.

The combination of all these things we see little 503's in the mesh, and basically none at the edge.

On the VirtualService for every service, we configure the envoy retry headers:

spec:
  http:
  - appendHeaders:
      x-envoy-max-retries: "10"
      x-envoy-retry-on: gateway-error,connect-failure,refused-stream

On the DestinationRule for high QPS (3k/sec over 6 pods) applications, we configure outlier detection, this would result in 400-500errors to 2-5, during a pod restart.

spec:
  trafficPolicy
    outlierDetection:
      maxEjectionPercent: 50
      baseEjectionTime: 30s
      consecutiveErrors: 5
      interval: 30s

On the pods, we configure the application container to have a preStop sleep, which gives time for the unready state of the pod (during termination) to populate to other envoys and the traffic to drain:

    lifecycle:
      preStop:
        exec:
          command:
          - sleep
          - "10"

On envoy, we have a custom pre-stop hook that waits for the primary application to stop listening:

#!/bin/bash
set -e
if ! pidof envoy &>/dev/null; then
  exit 0
fi

if ! pidof pilot-agent &>/dev/null; then
  exit 0
fi

while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do
  sleep 3;
done
exit 0

In the istio config we do policyCheckFailOpen: true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions