Update for new grpc library #2789

perdasilva · 2022-06-01T12:03:32Z

Description of the change:
Updates olm to use the lates operator-registry release

Motivation for the change:
It's a been a while, and the latest version contains the latest version of the grpc library with security fixes needed by the downstream. See operator-framework/operator-registry#959

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive
Tests marked as [FLAKE] are truly flaky
Tests that remove the [FLAKE] tag are no longer flaky

openshift-ci · 2022-06-01T12:03:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [perdasilva]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

perdasilva · 2022-06-09T09:09:47Z

@anik120 could you please review this? I'm a bit worried about the metrics. This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready. Any idea how this could affect SRE?

anik120

@perdasilva could you expand a bit more on

This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready.

From the code changes it looks like idle and ready state are treated the same now, whereas before there was a distinction between idle and ready. If that's accurate, what's the motivation for that? (It isn't very clear to me how the grpc_health_probe related CVEs are related to this change in how we treat the different states).

anik120 · 2022-06-09T14:09:07Z

pkg/controller/operators/catalog/operator.go

@@ -474,6 +474,8 @@ func (o *Operator) syncSourceState(state grpc.SourceState) {
 	metrics.RegisterCatalogSourceState(state.Key.Name, state.Key.Namespace, state.State)

 	switch state.State {
+	case connectivity.Idle:
+		fallthrough


Why are we falling through here exactly?
Also, could you elaborate on what Idle connectivity state means?

Good questions all around. So, in turn:
We need to bump the grpc_healthprobe version to satisfy the CVE -> this leads to a grpc library version bump.
The latest grpc library seems to have affected the connection state. Before the connection state would be READY once the CatalogSource pod was up. Now it sits on IDLE until there's an RPC call down the pipe. The transition from CONNECTION -> IDLE happens when there's "No RPC activity on channel for IDLE_TIMEOUT" doc.

Let me refresh myself on the code base to give you a better explanation about the passthrough

I see. fallthrough here (and in another place below) just means "if connectivity.Idle, do whatever you were supposed to do for connectivity.Ready". We probably want to confirm that that's the right thing to do to.

i.e with this change in the library, when connectivity is in Idle, it feels like the intuitive thing to do would be for the controller to "sit idle" too, instead with "fallthrough" it's going to follow through with 1. Invalidate cache 2. Add to resolve queue...etc etc. Is that what we want?

cc: @benluddy @njhale

Also, this would be considered a breaking change for our consumers, so either we find a way to set the CatalogSource .status.GRPCConnectionState.LastObservedState to Ready for both Idle and Ready, or we need to send out communications to our community to alert them about the change.

I spent yesterday afternoon poking around grpc and playing with keepalive and changing both server and client settings. I couldn't really get it to return to the previous behavior. I think leaving it on IDLE could have the adverse effect of only finding out the CatalogSource is down when you actually try to talk to it. I removed the IDLE handling code above and did a test with a customer CatalogSource created in the default namespaces. I observed that the connection state stays on IDLE even if you kill the backing pod. Presumably because the content hadn't been cached yet (i.e. no gRPC calls). From the cluster administration perspective, I'm not sure that this is desirable. While we reduce network use and CatalogSource server resources improving scalability. On the other hand, ops would only get notified of a bad CatalogSource at the time of use.

Another alternative, but it's a bit long winded:
Decouple the connection state from the registry server health (service can be up, but broken) and have metrics for connection state and registry server health implement the grpc health checks in the RegistryReconcilers (exp branch - very hacky)

For now, I've updated the code to create a connection in case of the IDLE state. This brings the behavior back inline with what we had before.

anik120 · 2022-06-09T15:29:26Z

Also, we probably want to set catalogSourceReady to 1 in both the Idle and Ready state here too.

Signed-off-by: perdasilva <perdasilva@redhat.com>

dinhxuanvu

/lgtm

openshift-bot · 2022-06-20T16:33:07Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

perdasilva · 2022-06-20T16:38:32Z

/hold

perdasilva · 2023-02-28T09:34:53Z

closing this as it's stale

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2022

openshift-ci bot requested review from dinhxuanvu and ecordell June 1, 2022 12:03

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022

perdasilva force-pushed the update_operator_registry_version branch 4 times, most recently from 001403e to 5e4bbd4 Compare June 1, 2022 14:28

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2022

perdasilva force-pushed the update_operator_registry_version branch from 5e4bbd4 to 5484d83 Compare June 9, 2022 08:14

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2022

perdasilva force-pushed the update_operator_registry_version branch 2 times, most recently from 60959e5 to 64bdba0 Compare June 9, 2022 09:01

perdasilva changed the title ~~[WIP] Update operator registry version~~ Update operator registry version Jun 9, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2022

anik120 reviewed Jun 9, 2022

View reviewed changes

perdasilva force-pushed the update_operator_registry_version branch 4 times, most recently from 70e785b to a72a7a5 Compare June 10, 2022 13:45

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2022

perdasilva force-pushed the update_operator_registry_version branch from dbbd723 to 8563067 Compare June 20, 2022 13:30

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2022

perdasilva force-pushed the update_operator_registry_version branch 3 times, most recently from d49a14c to ec2ecf2 Compare June 20, 2022 14:30

perdasilva changed the title ~~Update operator registry version~~ Update for new grpc library Jun 20, 2022

perdasilva added 2 commits June 20, 2022 17:52

Remove use of GinkgoWriter.Println from Ginkgo v2

b67661a

Signed-off-by: perdasilva <perdasilva@redhat.com>

Update tests and code for new grpc library

1ab67ce

Signed-off-by: perdasilva <perdasilva@redhat.com>

perdasilva force-pushed the update_operator_registry_version branch from ec2ecf2 to 1ab67ce Compare June 20, 2022 15:53

dinhxuanvu reviewed Jun 20, 2022

View reviewed changes

openshift-ci bot assigned dinhxuanvu Jun 20, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2022

perdasilva closed this Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for new grpc library #2789

Update for new grpc library #2789

perdasilva commented Jun 1, 2022

openshift-ci bot commented Jun 1, 2022

perdasilva commented Jun 9, 2022

anik120 left a comment

anik120 Jun 9, 2022

perdasilva Jun 9, 2022

anik120 Jun 9, 2022

anik120 Jun 9, 2022

perdasilva Jun 10, 2022 •

edited

anik120 commented Jun 9, 2022

dinhxuanvu left a comment

openshift-bot commented Jun 20, 2022

perdasilva commented Jun 20, 2022

perdasilva commented Feb 28, 2023

Update for new grpc library #2789

Update for new grpc library #2789

Conversation

perdasilva commented Jun 1, 2022

openshift-ci bot commented Jun 1, 2022

perdasilva commented Jun 9, 2022

anik120 left a comment

Choose a reason for hiding this comment

anik120 Jun 9, 2022

Choose a reason for hiding this comment

perdasilva Jun 9, 2022

Choose a reason for hiding this comment

anik120 Jun 9, 2022

Choose a reason for hiding this comment

anik120 Jun 9, 2022

Choose a reason for hiding this comment

perdasilva Jun 10, 2022 • edited

Choose a reason for hiding this comment

anik120 commented Jun 9, 2022

dinhxuanvu left a comment

Choose a reason for hiding this comment

openshift-bot commented Jun 20, 2022

perdasilva commented Jun 20, 2022

perdasilva commented Feb 28, 2023

perdasilva Jun 10, 2022 •

edited