Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Get api should be reliable.
We saw this error a few times:
In cases of reconcilliation, this might not be an issue because in 30 seconds, another attempt will happen and will likely succeed.
In activator's case though, a transient failure like this causes the call to be dropped. But we should instead retry this a couple of times before we drop the call. @akyyy can you please open a tracing item for activator to handle this case?
Given a distinct activator issue for this, I'm not sure what the scope of this issue is?
tl;dr Without HA masters I think that this is just a reality of the world in which we live.
The availability of our control plane is tied to master availability, which can be low (~99.5%?).
We should strive to maximize the availability of our data plane, which would ideally be distinct, but creeps in when you start to scale based on data plane metrics. I think we maximize data plane availability by minimizing our hard dependency on the control plane in the data plane.
I believe the only place with a truly hard dependency is the activator.
Autoscaling is clearly affected as well, but besides 0->1 its success doesn't block request routing.