Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on Get Revision error? #1558

Closed
akyyy opened this issue Jul 10, 2018 · 6 comments

Comments

@akyyy
Copy link
Contributor

@akyyy akyyy commented Jul 10, 2018

/area API
/kind bug

Expected Behavior

Get api should be reliable.

Actual Behavior

We saw this error a few times:
Unable to get revision: Get https://10.35.240.1:443/apis/serving.knative.dev/v1alpha1/namespaces/default/revisions/configuration-example-00001: unexpected EOF
In this case, a potential reason is master is down.

Additional Info

@akyyy

This comment has been minimized.

Copy link
Contributor Author

@akyyy akyyy commented Jul 10, 2018

@mdemirhan saw this as well.

@mdemirhan

This comment has been minimized.

Copy link
Member

@mdemirhan mdemirhan commented Jul 10, 2018

In cases of reconcilliation, this might not be an issue because in 30 seconds, another attempt will happen and will likely succeed.

In activator's case though, a transient failure like this causes the call to be dropped. But we should instead retry this a couple of times before we drop the call. @akyyy can you please open a tracing item for activator to handle this case?

@akyyy

This comment has been minimized.

Copy link
Contributor Author

@akyyy akyyy commented Jul 11, 2018

The activator specific issue is #1573

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jul 12, 2018

Given a distinct activator issue for this, I'm not sure what the scope of this issue is?

tl;dr Without HA masters I think that this is just a reality of the world in which we live.

The availability of our control plane is tied to master availability, which can be low (~99.5%?).

We should strive to maximize the availability of our data plane, which would ideally be distinct, but creeps in when you start to scale based on data plane metrics. I think we maximize data plane availability by minimizing our hard dependency on the control plane in the data plane.

I believe the only place with a truly hard dependency is the activator.

Autoscaling is clearly affected as well, but besides 0->1 its success doesn't block request routing.

@mdemirhan

This comment has been minimized.

Copy link
Member

@mdemirhan mdemirhan commented Jul 12, 2018

Given that we have #1573 for activator, I propose that we close this one. @akyyy @mattmoor WDYT?

@akyyy

This comment has been minimized.

Copy link
Contributor Author

@akyyy akyyy commented Jul 12, 2018

sgtm

@akyyy akyyy closed this Jul 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.