Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info: Everything we do on 1.0.6 to minimise 503s #12183

Closed
Stono opened this issue Mar 1, 2019 · 20 comments

Comments

6 participants
@Stono
Copy link
Contributor

commented Mar 1, 2019

I've had several chats with people recently so I'm putting there here to capture everything we do on 1.0.6 to deal with 503's, and then people can tell me what shouldn't be required as of 1.1.x.

The combination of all these things we see little 503's in the mesh, and basically none at the edge.

On the VirtualService for every service, we configure the envoy retry headers:

spec:
  http:
  - appendHeaders:
      x-envoy-max-retries: "10"
      x-envoy-retry-on: gateway-error,connect-failure,refused-stream

On the DestinationRule for high QPS (3k/sec over 6 pods) applications, we configure outlier detection, this would result in 400-500errors to 2-5, during a pod restart.

spec:
  trafficPolicy
    outlierDetection:
      maxEjectionPercent: 50
      baseEjectionTime: 30s
      consecutiveErrors: 5
      interval: 30s

On the pods, we configure the application container to have a preStop sleep, which gives time for the unready state of the pod (during termination) to populate to other envoys and the traffic to drain:

    lifecycle:
      preStop:
        exec:
          command:
          - sleep
          - "10"

On envoy, we have a custom pre-stop hook that waits for the primary application to stop listening:

#!/bin/bash
set -e
if ! pidof envoy &>/dev/null; then
  exit 0
fi

if ! pidof pilot-agent &>/dev/null; then
  exit 0
fi

while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do
  sleep 3;
done
exit 0

In the istio config we do policyCheckFailOpen: true

@stefanprodan

This comment has been minimized.

Copy link

commented Mar 2, 2019

How do you add the pre-stop hook to Envoy? Can it be done in the injection webhook config?

@Stono

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2019

@Stocco

This comment has been minimized.

Copy link

commented Mar 2, 2019

Hello @Stono thanks for the info. We are afraid of adding the retry policy in all services because of the worries if the apps will handle duplicate transactions in case of envoy sending the request again for whatever reason he wants to.
However, in istio 1.0.5 we are not seeing 503' anymore as we were seeing on versions <= 1.0.2 we only have 503/404 errors when pilot starts to have problems.
Now for clarifications, what does the outlierDetection part means? It will affect envoy circuit break definitions?

@Stono

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2019

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

@Stono Hey! Thanks for sharing this. Really helped me shape the general direction I should go.

On the VirtualService for every service, we configure the envoy retry headers:

Sounds great. If I add one thing, we should ensure that only idempotent requests are retried.

Instead of:

spec:
  http:
  - appendHeaders:
      x-envoy-max-retries: "10"
      x-envoy-retry-on: gateway-error,connect-failure,refused-stream

We should probably do:

spec:
  http:
  - match:
    - method:
        exact: get
    appendHeaders:
      x-envoy-max-retries: "10"
      x-envoy-retry-on: gateway-error,connect-failure,refused-stream
    route:
    - destination:
        host: ratings.prod.svc.cluster.local
  - route:
    - destination:
        host: ratings.prod.svc.cluster.local

Sry if you've already done this/and my suggestion didn't work in any way. I just couldn't help leaving a note :)

@Stono

This comment has been minimized.

Copy link
Contributor Author

commented Mar 22, 2019

@mumoshu I actually think for gateway-error,connect-failure,refused-stream you're safe to retry POST, PUT etc too, as those errors mean the request never made it to the target application anyhoo.

It's all a bit moot in 1.1.0 anyway as this retry stuff happens transparently!

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

Also, I agree that the prestop sleep is a must-have, and it isn't actually Istio or workload specific as far as I can say. I even hope we can extend the Istio's Sidecar resource it accept prestop sleep length.

Envoy, hence Istio, is eventually consistent. To make our Envoys/Istio sidecars and ingress gateways NOT to accidentally send traffic to "just-stopped" endpoints, we need a special care.

// I did had to add such prestop sleeps to my envoy upstreams when using a vanilla envoy with STRICT_DNS as a front proxy, for the same reason

The "special care" here seems to be a "grace period", to prevent the envoy sidecar and the app container from exiting, after your pod receives SIGTERM, until other envoys stops sending traffic to it.

Probably the only way to implement this "grace period" today is to build your own sidecar image and add the prestop sleep?
Then I think Sidecar resource should better enhanced to accept the sleep length for ease of use. Just my two cents.

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

@Stono Thanks for clarifying. Maybe you're correct!

But my impression was that, if gateway-error stands for 502, envoy still have a chance to receive it after they transferred your POST request to the upstream, the upstream processed it and prepared a response, but then the connection between the envoy and the upstream failed. We won't like envoy to retry the POST request in this case, though?

I hope I'm thinking too much and this doesn't happen in reality :)

@iandyh

This comment has been minimized.

Copy link

commented Mar 22, 2019

@mumoshu Hello. I've sent a PR to add the preStop to gateway deployment.

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

@iandyh Hey! Just took a quick look at your PR #12342 - I think we need to consider two cases,

The one is sidecar -> gateway, and the another is gateway -> sidecar, sidecar -> sidecar.

Your workaround of adding sleep 30 to the gateway should work, only in a way that all the other sidecars who wants to talk to the gateway has a grace period of 30 seconds until it stops traffic to the gateway.

That is, we may also need to add sidecars prestop sleeps, for gateway -> sidecar and sidecar -> sidecar cases.

@iandyh

This comment has been minimized.

Copy link

commented Mar 22, 2019

@mumoshu sidecar actually does not talk to gateway since gateway only route the traffic into the cluster.

We are actually going to add the preStop hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

@iandyh Thanks for confirming! Probably you're talking about the sidecar -> gateway case, right? I believe you're correct on that then 👍

I'd rephrase my idea to this: for cases like gateway -> sidecar and sidecar -> sidecar, we need prestop sleeps.

@mumoshu

This comment has been minimized.

Copy link

commented Mar 22, 2019

We are actually going to add the preStop hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.

@iandyh I haven't encountered it yet myself but to confirm - you're talking about the case when kubeproxy + iptables + Service Endpoints not catching up pod terminations fast enough?

@iandyh

This comment has been minimized.

Copy link

commented Mar 23, 2019

We are actually going to add the preStop hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.

@iandyh I haven't encountered it yet myself but to confirm - you're talking about the case when kubeproxy + iptables + Service Endpoints not catching up pod terminations fast enough?

The issue we are discussing actually is entirely different from the main thread. You can read the details here: https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/

@mumoshu

This comment has been minimized.

Copy link

commented Mar 25, 2019

@iandyh Thanks! I agree on you about what you're considering is described in https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/.

Can you agree on, not only your app containers, but also istio-pilot/sidecar containers need prestop sleep? 😃

My understanding is that (1)all the containers in a pod, hence your app and the sidecar, receive SIGTERM concurrently (2)envoy(without prestop sleep) stopping before the app container(with proper prestop sleep) still results in intermittent errors in the following architecture in my mind:

your app pod A ---> sidecar for A ---> sidecar for B ---> your app pod B

@iandyh

This comment has been minimized.

Copy link

commented Mar 25, 2019

@mumoshu yes, the only correct way of doing this is Envoy(sidecar) still receives traffic even it's being killed by the cluster(at the same time, no more new requests being send to this Envoy), drain some time, can be a perfect drain, can be a timed drain(what Lyft is doing). After draining, SIGKILL .
According to this PR: #11630, Envoy will still reject the new connections, which means we still need a preStop hook.

@mumoshu

This comment has been minimized.

Copy link

commented Mar 25, 2019

@iandyh Fully understood.

If you don't mind paraphrasing - Yes, I agree that even after #11630 our sidecar need a preStop sleep, so that it won't reject new reqs arrived after SIGTERM but before other envoys actually stop sending.

Thanks for sharing your insight ☺️

@Stono

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

For those following this thread, we've removed pretty much everything we've done apart from the lifecycle preStop (so no headers or outlier detection now), implemented Sidecar to massively limit the scope of pushes (our push durations have dropped from 30s to 2s) and in 1.1.1 and we're not seeing any 503s.

@Stono

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2019

Gonna close this with the summary of "Update to 1.1.1".

@Stono Stono closed this Apr 3, 2019

@dbyron0

This comment has been minimized.

Copy link

commented Apr 5, 2019

@Stono can I ask for a bit of clarification? Sorry if this clear to everyone else but there are enough moving parts that I'd like to make sure I get this right. When you say you did lifecycle preStop does that mean that you added the preStop to all of your app containers? Or does that go somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.