Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingress with simple routes periodically 503ing (was 404) #1038

Closed
ldemailly opened this issue Oct 5, 2017 · 46 comments

Comments

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2017

for 404 vs 503: envoyproxy/envoy#1820

for 404 itself matt kindly pointed me at https://envoyproxy.github.io/envoy/configuration/http_conn_man/route_config/route_config.html#config-http-conn-man-route-table
and "validate_clusters" - I want to add a test first and then see if indeed that fixes it

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2017

I have it reproduced in #1041

mandarjog pushed a commit to mandarjog/istio that referenced this issue Oct 30, 2017
Former-commit-id: 3363f0184bb7139c8f4e82507e3f360a204436ef
mandarjog pushed a commit that referenced this issue Oct 31, 2017
Former-commit-id: 597e926db0e68a3979b5b6c2554116ae32dbb57a
mandarjog pushed a commit that referenced this issue Oct 31, 2017
* Initial version of deb files for agent and discovery (raw VM discovery WIP)

* Fix build

* Update descriptions

* Use files instead of srcs

* Simplify the PR - for 0.2 we only need agent, standalone discovery needs more work
@ldemailly ldemailly added this to the Istio 0.3 milestone Nov 2, 2017
@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2017

What is the on-going story here? How close is the team to identifying a root cause? We can help however possible to help isolate the issue.

We have begun running some performance tests against applications running in our cluster, but are seeing a stream of 404s, up to 50% of requests at 450tps. During these tests, we have enabled Horizontal Pod Autoscaling.

When we disable Horizonte Pod Autoscaling (locking in at 1 Pod), we saw very different results. While the error rate was still non-0 these were likely not caused by the same 404 issue.

Initially we saw a mixture of 404s and 50s. When we enabled liveness and readiness probes, the 50 status codes reduced significantly.

@kyessenov

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2017

We know the root cause of this behavior. There are on-going efforts:

  1. The next generation of the Envoy configuration protocol should fix this fundamentally. You can read more about this here. Unfortunately, this won't be ready soon since there's a lot of plumbing to do in both envoy and management plane.

  2. There're other efforts to enable probes and solve 404 using them in the shorter term. I'm not involved in that, @andraxylia or @costinm could probably elaborate.

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Nov 22, 2017

there is also a test reproducing the issue #1041

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2017

Are there expected timelines for both 1 and 2? At this point we're trying to understand if using Istio is a viable solution for our own timelines.

Would downgrading to 0.1.6 solve the problem?

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2017

Just a follow-up: we have run another test with 5 Pods, but no Horizonte Pod Autoscaler. For around 5 minutes we saw the consistent 404s, but eventually hit a steady state of 450tps spread across the 5 Pods. This problem down not appear to be related to throughput though.

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2017

Bump, hoping to get an answer to the questions I posted. We are super excited about using Istio, but want to make sure it's possible to use it without triggering (or working around) the 404s. It sounds like this issue surfaces when Envoy configuration changes which would be cause by route-rule updates or when the state of individual Pods changes behind a Service (both very common and potentially uncontrollable cases).

Is this only affecting the 0.2 release of Istio? Has anyone invested effort to determine if 0.1.6 is affected?

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2017

Are you affected by the Ingress on startup or by the route rule config update? What is your timeline?

For 2), we are working on a few things, tentative is end of December (release 0.4):

Envoy will be fixed to return 503 instead of 404 for routing rules change envoy/envoyproxy#data-plane-api/246. We also investigate configuring an empty cluster.

For Ingress short term we are looking at a combination of readiness probes + snapshot of config on disk to deal with envoy restart by pilot agent.

We have not investigated 0.1.6, but it likely does not have the 404 on route-rule changes, because the 404 is caused by introduction of Envoy LDS API in 0.2.

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2017

Are you affected by the Ingress on startup or by the route rule config update?
We seem to be affected upon scaling of a Deployment (typically quickly up or down). We can test a routing rule change, but haven't specifically tested that.

Envoy will be fixed to return 503 instead of 404 for routing rules change envoy/envoyproxy#data-plane-api/246. We also investigate configuring an empty cluster.
What is the behavior this is fixing in terms of Istio?

For Ingress short term we are looking at a combination of readiness probes + snapshot of config on disk to deal with envoy restart by pilot agent.
Is this the fix that will be provided by end of December?

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2017

Thanks for clarifying the trigger is the scaling of the Deployment. We need to root cause this separately from the routing rules change. Can you please post your relevant routing rules & ingress config? What is you overall scale in terms of number of pods and services, and how many pods / service do you have in the beginning and at the end?

For Envoy fix: This is making Istio return 503 instead of 404 when the RDS response refers to clusters that do not exist. Because Envoy calls Pilot's RDS and CDS APIs separately, in any order, RDS response may arrive before CDS. Longer term, the fix will be to use the new ADS Envoy v2 API (Aggregated Discovery service), that does not have this problem.

We hope to get a new Envoy build containing the above fix before end of Dec + rest of the fixes (readiness probes etc ). The release cadence is every month, if it is not Dec it will be Jan.

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2017

I've attached a repro case using an echo container image.
repro.yaml.zip

@mclarke47

This comment has been minimized.

Copy link

commented Dec 7, 2017

Is this still being targeted for 0.3.x?

@kdillane

This comment has been minimized.

Copy link
Contributor

commented Dec 16, 2017

For the repro case I provided, it looks like this may have been in part caused by a bug in Kubernetes (kubernetes/kubernetes#53690). With Horizontal Pod Autoscaling enabled, it was possible for the desired number of replicas to exceed the max number, causing the number of replicas to jump up and down. This did help us to uncover the issue with route change propagation, but the specific repro should stop working as of Kubernetes 1.9.

@ldemailly ldemailly modified the milestones: Istio 0.3, Istio 0.5 Dec 16, 2017
@kdillane

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2018

Is there some way to follow progress on the effort to address/solve this Issue?

@rshriram rshriram assigned rshriram and unassigned rshriram Jun 12, 2018
@rshriram

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

@sakshigoel12 we have ingress gateway tests for the 503 and they were passing, until we had to disable the test due to mixer filter enablement.
I think @ldemailly wants somebody else to port his old PoC test to use the new networking APIs. I would focus more on re-enabling the existing tests.

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2018

Afaik we still have 503s and the newer tests only passed some times because they aren't deterministic was my understanding - in other words, a customer making changes will see 503s when they shouldn't is still the status afaik - @andraxylia knows more details.

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

@sakshigoel12 any reason you removed the 1.0 label ? do we really think it's ok for istio to generate spurious errors ?

@rshriram

This comment has been minimized.

Copy link
Member

commented Jun 19, 2018

Please produce a repeatable test to show 503s are still occurring with the gateways, when updating virtual services/destination rules in the "prescribed" manner (add subsets to destination rules, then add weights to virtual services. Remove weights from virtual services than remove subsets from destination rules).

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

I will take care of this.

@sakshigoel12

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

We are using milestone (and pipeline) vs label to indicate 1.0 requirements. It is setup correctly.

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

yes, thanks, nevermind, it looks like the label is completely deleted / I didn't know

@louiscryan

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

@andraxylia can you provide an update here on the current known state. Is this bug too broad at this point?

@louiscryan louiscryan modified the milestones: 1.1, 1.0 Jun 22, 2018
@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

@louiscryan we have a test for this but I wanted to do some more manual tests before we claim it is fixed. The bug is not too broad, it covers the scenarios of route rule changes.

@jasminejaksic

This comment has been minimized.

Copy link
Member

commented Jun 26, 2018

@andraxylia Can you please test it manually with 0.8 and confirm whether this still exists?

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2018

There are still some issues with 0.8 image, though not in all scenarios. I tried with bookinfo app, changing the VirtualService as below produces traffic loss.

istioctl replace -f samples/bookinfo/routing/route-rule-reviews-80-20.yaml

to

istioctl replace -f samples/bookinfo/routing/route-rule-all-v1.yaml

This can be easily reproduced by installing booking, running fortio continuosly from the laptop, and playing with the existing traffic routing rules defined in samples/bookinfo/routing/route-rule*. The fortio command is:

fortio load -loglevel warning -c 1 -qps 8 -t 0 <GW_IP>/productpage

This is the error message:
15:04:32 W http_client.go:581> Parsed non ok code 503 (HTTP/1.1 503)

I will try with a latest image that has RDS.

@jasminejaksic

This comment has been minimized.

Copy link
Member

commented Jul 2, 2018

@douglas-reid Per our discussion, please verify that the 503 scenario you encountered is still reproducible. If not, let's close this issue tomorrow.

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2018

This is fixed now by the RDS changes.

@andraxylia andraxylia closed this Jul 2, 2018
@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2018

As discussed with @douglas-reid , he was seeing 503s with stable traffic , which is covered by #3295, and with a qps of 20.

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2018

but that other issue is also claimed to be fixed, so what is the explanation for what doug observed?

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2018

It's the bookinfo not handling well concurrent requests. If I run fortio with qps 20 and 4 concurrent threads, I see some 503s from time to time, but if I run it with 1 thread and 1000 qps there is no 503.

fortio load -loglevel warning -c 1 -qps 1000 -t 0 35.232.86.12/productpage

@costinm ran 1000 qps with fortio on the backends and concurrency 4, for about a month, without seeing any 503s. I saw the stress test dashboard with 100% success.

@jasminejaksic

This comment has been minimized.

Copy link
Member

commented Jul 3, 2018

Added #6826 for documenting the known issue

@ldemailly

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2018

are you saying you see 503s with bookinfo without istio?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.