Description
I've had several chats with people recently so I'm putting there here to capture everything we do on 1.0.6 to deal with 503's, and then people can tell me what shouldn't be required as of 1.1.x.
The combination of all these things we see little 503's in the mesh, and basically none at the edge.
On the VirtualService for every service, we configure the envoy retry headers:
spec:
http:
- appendHeaders:
x-envoy-max-retries: "10"
x-envoy-retry-on: gateway-error,connect-failure,refused-stream
On the DestinationRule for high QPS (3k/sec over 6 pods) applications, we configure outlier detection, this would result in 400-500errors to 2-5, during a pod restart.
spec:
trafficPolicy
outlierDetection:
maxEjectionPercent: 50
baseEjectionTime: 30s
consecutiveErrors: 5
interval: 30s
On the pods, we configure the application container to have a preStop sleep, which gives time for the unready state of the pod (during termination) to populate to other envoys and the traffic to drain:
lifecycle:
preStop:
exec:
command:
- sleep
- "10"
On envoy, we have a custom pre-stop hook that waits for the primary application to stop listening:
#!/bin/bash
set -e
if ! pidof envoy &>/dev/null; then
exit 0
fi
if ! pidof pilot-agent &>/dev/null; then
exit 0
fi
while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do
sleep 3;
done
exit 0
In the istio config we do policyCheckFailOpen: true