Skip to content

503 errors when scaling down, or rolling out a new application version #7665

@Stono

Description

@Stono

Describe the bug
Hey,
We are noticing blips in services under load during kubernetes rollouts. We observe a handful of 503 errors from istio-proxy on the pod being removed (either because of a rollout, or a scale down). This screenshot is from three separate "scale downs":

screen shot 2018-08-06 at 15 29 13

When scaling down, this is the sequence of events we observe:

  • pod goes into TERMINATING state and is removed from kubernetes endpoints
  • A handful of the last requests to the pod are reported by istio-proxy as 503
  • those requests are also logged in the upstream calling service as 503
  • application exits
  • istio-proxy exits

As you can see here:

screen shot 2018-08-06 at 15 28 59

At the moment, our only saving grace is that we have configured a retry policy which means our end users experience a bit of a slow request, but not a failure - however relying on a retry mechanism in this scenario doesn't feel right.

Expected behavior
The isito-proxy on the application being scaled down should not receive any requests after it has entered a TERMINATING state.

Steps to reproduce the bug
As above, but I can get on hangouts and show you this in detail.

The application itself gracefully handles sigterms and drains and have confirmed this with load tests without istio-proxy in play. I have also added a preStop hook to the application with istio, to ensure the app doesn't receive a SIGTERM until well after istio-proxy shuts down.

Version
gke 1.10.5, istio 1.0

Is Istio Auth enabled or not?
Yes

Environment
GKE

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions