Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for new grpc library #2789

Conversation

perdasilva
Copy link
Contributor

Description of the change:
Updates olm to use the lates operator-registry release

Motivation for the change:
It's a been a while, and the latest version contains the latest version of the grpc library with security fixes needed by the downstream. See operator-framework/operator-registry#959

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky
  • Tests that remove the [FLAKE] tag are no longer flaky

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jun 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022
@perdasilva perdasilva force-pushed the update_operator_registry_version branch 4 times, most recently from 001403e to 5e4bbd4 Compare June 1, 2022 14:28
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2022
@perdasilva perdasilva force-pushed the update_operator_registry_version branch from 5e4bbd4 to 5484d83 Compare June 9, 2022 08:14
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2022
@perdasilva perdasilva force-pushed the update_operator_registry_version branch 2 times, most recently from 60959e5 to 64bdba0 Compare June 9, 2022 09:01
@perdasilva
Copy link
Contributor Author

@anik120 could you please review this? I'm a bit worried about the metrics. This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready. Any idea how this could affect SRE?

@perdasilva perdasilva changed the title [WIP] Update operator registry version Update operator registry version Jun 9, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2022
Copy link
Contributor

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perdasilva could you expand a bit more on

This update changes the way grpc connection states change. Basically they remain on idle until there's an rpc call before they transition to ready.

From the code changes it looks like idle and ready state are treated the same now, whereas before there was a distinction between idle and ready. If that's accurate, what's the motivation for that? (It isn't very clear to me how the grpc_health_probe related CVEs are related to this change in how we treat the different states).

@@ -474,6 +474,8 @@ func (o *Operator) syncSourceState(state grpc.SourceState) {
metrics.RegisterCatalogSourceState(state.Key.Name, state.Key.Namespace, state.State)

switch state.State {
case connectivity.Idle:
fallthrough
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we falling through here exactly?
Also, could you elaborate on what Idle connectivity state means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions all around. So, in turn:
We need to bump the grpc_healthprobe version to satisfy the CVE -> this leads to a grpc library version bump.
The latest grpc library seems to have affected the connection state. Before the connection state would be READY once the CatalogSource pod was up. Now it sits on IDLE until there's an RPC call down the pipe. The transition from CONNECTION -> IDLE happens when there's "No RPC activity on channel for IDLE_TIMEOUT" doc.

Let me refresh myself on the code base to give you a better explanation about the passthrough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. fallthrough here (and in another place below) just means "if connectivity.Idle, do whatever you were supposed to do for connectivity.Ready". We probably want to confirm that that's the right thing to do to.

i.e with this change in the library, when connectivity is in Idle, it feels like the intuitive thing to do would be for the controller to "sit idle" too, instead with "fallthrough" it's going to follow through with 1. Invalidate cache 2. Add to resolve queue...etc etc. Is that what we want?

cc: @benluddy @njhale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this would be considered a breaking change for our consumers, so either we find a way to set the CatalogSource .status.GRPCConnectionState.LastObservedState to Ready for both Idle and Ready, or we need to send out communications to our community to alert them about the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent yesterday afternoon poking around grpc and playing with keepalive and changing both server and client settings. I couldn't really get it to return to the previous behavior. I think leaving it on IDLE could have the adverse effect of only finding out the CatalogSource is down when you actually try to talk to it. I removed the IDLE handling code above and did a test with a customer CatalogSource created in the default namespaces. I observed that the connection state stays on IDLE even if you kill the backing pod. Presumably because the content hadn't been cached yet (i.e. no gRPC calls). From the cluster administration perspective, I'm not sure that this is desirable. While we reduce network use and CatalogSource server resources improving scalability. On the other hand, ops would only get notified of a bad CatalogSource at the time of use.

Another alternative, but it's a bit long winded:
Decouple the connection state from the registry server health (service can be up, but broken) and have metrics for connection state and registry server health implement the grpc health checks in the RegistryReconcilers (exp branch - very hacky)

For now, I've updated the code to create a connection in case of the IDLE state. This brings the behavior back inline with what we had before.

@anik120
Copy link
Contributor

anik120 commented Jun 9, 2022

Also, we probably want to set catalogSourceReady to 1 in both the Idle and Ready state here too.

@perdasilva perdasilva force-pushed the update_operator_registry_version branch 4 times, most recently from 70e785b to a72a7a5 Compare June 10, 2022 13:45
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2022
@perdasilva perdasilva force-pushed the update_operator_registry_version branch from dbbd723 to 8563067 Compare June 20, 2022 13:30
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2022
@perdasilva perdasilva force-pushed the update_operator_registry_version branch 3 times, most recently from d49a14c to ec2ecf2 Compare June 20, 2022 14:30
@perdasilva perdasilva changed the title Update operator registry version Update for new grpc library Jun 20, 2022
Signed-off-by: perdasilva <perdasilva@redhat.com>
Signed-off-by: perdasilva <perdasilva@redhat.com>
@perdasilva perdasilva force-pushed the update_operator_registry_version branch from ec2ecf2 to 1ab67ce Compare June 20, 2022 15:53
Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2022
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@perdasilva
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2022
@perdasilva
Copy link
Contributor Author

closing this as it's stale

@perdasilva perdasilva closed this Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants