New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update for new grpc library #2789
Update for new grpc library #2789
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: perdasilva The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
001403e
to
5e4bbd4
Compare
5e4bbd4
to
5484d83
Compare
60959e5
to
64bdba0
Compare
@anik120 could you please review this? I'm a bit worried about the metrics. This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready. Any idea how this could affect SRE? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@perdasilva could you expand a bit more on
This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready.
From the code changes it looks like idle and ready state are treated the same now, whereas before there was a distinction between idle and ready. If that's accurate, what's the motivation for that? (It isn't very clear to me how the grpc_health_probe
related CVEs are related to this change in how we treat the different states).
@@ -474,6 +474,8 @@ func (o *Operator) syncSourceState(state grpc.SourceState) { | |||
metrics.RegisterCatalogSourceState(state.Key.Name, state.Key.Namespace, state.State) | |||
|
|||
switch state.State { | |||
case connectivity.Idle: | |||
fallthrough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we falling through here exactly?
Also, could you elaborate on what Idle
connectivity state means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions all around. So, in turn:
We need to bump the grpc_healthprobe version to satisfy the CVE -> this leads to a grpc library version bump.
The latest grpc library seems to have affected the connection state. Before the connection state would be READY once the CatalogSource pod was up. Now it sits on IDLE until there's an RPC call down the pipe. The transition from CONNECTION -> IDLE happens when there's "No RPC activity on channel for IDLE_TIMEOUT" doc.
Let me refresh myself on the code base to give you a better explanation about the passthrough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. fallthrough
here (and in another place below) just means "if connectivity.Idle, do whatever you were supposed to do for connectivity.Ready". We probably want to confirm that that's the right thing to do to.
i.e with this change in the library, when connectivity is in Idle, it feels like the intuitive thing to do would be for the controller to "sit idle" too, instead with "fallthrough" it's going to follow through with 1. Invalidate cache 2. Add to resolve queue...etc etc. Is that what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this would be considered a breaking change for our consumers, so either we find a way to set the CatalogSource .status.GRPCConnectionState.LastObservedState
to Ready
for both Idle
and Ready
, or we need to send out communications to our community to alert them about the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent yesterday afternoon poking around grpc and playing with keepalive and changing both server and client settings. I couldn't really get it to return to the previous behavior. I think leaving it on IDLE could have the adverse effect of only finding out the CatalogSource is down when you actually try to talk to it. I removed the IDLE handling code above and did a test with a customer CatalogSource created in the default namespaces. I observed that the connection state stays on IDLE even if you kill the backing pod. Presumably because the content hadn't been cached yet (i.e. no gRPC calls). From the cluster administration perspective, I'm not sure that this is desirable. While we reduce network use and CatalogSource server resources improving scalability. On the other hand, ops would only get notified of a bad CatalogSource at the time of use.
Another alternative, but it's a bit long winded:
Decouple the connection state from the registry server health (service can be up, but broken) and have metrics for connection state and registry server health implement the grpc health checks in the RegistryReconciler
s (exp branch - very hacky)
For now, I've updated the code to create a connection in case of the IDLE state. This brings the behavior back inline with what we had before.
Also, we probably want to set |
70e785b
to
a72a7a5
Compare
dbbd723
to
8563067
Compare
d49a14c
to
ec2ecf2
Compare
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
ec2ecf2
to
1ab67ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/hold |
closing this as it's stale |
Description of the change:
Updates olm to use the lates operator-registry release
Motivation for the change:
It's a been a while, and the latest version contains the latest version of the grpc library with security fixes needed by the downstream. See operator-framework/operator-registry#959
Reviewer Checklist
/doc
[FLAKE]
are truly flaky[FLAKE]
tag are no longer flaky