Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller/operators: label ConfigMaps, don't assume they are #3119

Conversation

stevekuznetsov
Copy link
Member

In the past, OLM moved to using a label selector to filter the informers that track ConfigMaps in the cluster. However, when this was done, previous ConfigMaps on the cluster that already existed were not labelled. Therefore, on old clusters there is a mix of data - ConfigMaps that OLM created and managed but has now forgotten since they are missing labels, and conformant objects with the label.

We use ConfigMaps to track whether or not Jobs should be labelled - if a Job has an OwnerReference to a ConfigMap and the ConfigMap has an OwnerReference to an OLM GVK, we know that the Job is created and managed by OLM.

During runtime, the two-hop lookup described above is done by using a ConfigMap informer, so we're light on client calls during the labelling phase of startup. However, before the recent labelling work went in, the ConfigMap informer was already filtered by label, so our lookups were dead-ends for the few old ConfigMaps that had never gotten labels in the past. However, on startup we use live clients to determine if there are unlabelled objects we need to handle, so we end up in a state where the live lookup can detect the errant Jobs but the informer-based labellers can't see them as needing labels.

This commit is technically a performance regression, as it reverts the unequivocal ConfigMap informer filtering - we see all ConfigMaps on the cluster during startup, but continue to filter as expected once everything has labels.

Ideally, we can come up with some policies for cleanup of things like these Jobs and ConfigMaps in the future; at a minimum all of the OLM objects should be labelled and visible to the OLM operators from here on out.

@awgreene
Copy link
Member

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2023
Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, the reason behind the label filter is to reduce the memory usage that can be a problem with OLM as it literally caches every configmap on the cluster which can be a lot. In practice, OLM doesn't really care much about configmaps except for the ones that it creates for bundle unpack operation.

As you pointed out, it is possible for older configmaps not to have that label. However, there are codes to handle that situation by adding the label to those configmaps in case of cache miss. I wonder if there is a bug somewhere that prevents that logic from working. Did you observe this issue recently?

Either way, I have no problems with this fix as long as the filtering will be applied later to prevent OLM from caching every configmap.

@stevekuznetsov
Copy link
Member Author

@dinhxuanvu yes, there must be a bug in the code to label on cache miss, as we're seeing this in the wild.

In the past, OLM moved to using a label selector to filter the informers
that track ConfigMaps in the cluster. However, when this was done,
previous ConfigMaps on the cluster that already existed were not
labelled. Therefore, on old clusters there is a mix of data - ConfigMaps
that OLM created and managed but has now forgotten since they are
missing labels, and conformant objects with the label.

We use ConfigMaps to track whether or not Jobs should be labelled - if a
Job has an OwnerReference to a ConfigMap and the ConfigMap has an
OwnerReference to an OLM GVK, we know that the Job is created and
managed by OLM.

During runtime, the two-hop lookup described above is done by using a
ConfigMap informer, so we're light on client calls during the labelling
phase of startup. However, before the recent labelling work went in, the
ConfigMap informer was *already* filtered by label, so our lookups were
dead-ends for the few old ConfigMaps that had never gotten labels in the
past. However, on startup we use live clients to determine if there are
unlabelled objects we need to handle, so we end up in a state where the
live lookup can detect the errant Jobs but the informer-based labellers
can't see them as needing labels.

This commit is technically a performance regression, as it reverts the
unequivocal ConfigMap informer filtering - we see all ConfigMaps on the
cluster during startup, but continue to filter as expected once
everything has labels.

Ideally, we can come up with some policies for cleanup of things like
these Jobs and ConfigMaps in the future; at a minimum all of the OLM
objects should be labelled and visible to the OLM operators from here on
out.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
ncdc
ncdc previously approved these changes Nov 27, 2023
@ncdc ncdc enabled auto-merge November 27, 2023 16:10
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
ncdc
ncdc previously approved these changes Nov 29, 2023
ConfigMaps provided for the internal source type are user-created and
won't have our labels, so we need to use a live client to fetch them.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
ncdc
ncdc previously approved these changes Nov 29, 2023
@joelanford
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023
Copy link

openshift-ci bot commented Nov 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, ncdc, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joelanford
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023
@ncdc ncdc added this pull request to the merge queue Nov 29, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
@stevekuznetsov
Copy link
Member Author

the flakes are so horrendous now, holy crap.
new one:

• [FAILED] [322.441 seconds]
Subscription [It] creation in case of transferring providedAPIs
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2028

 Captured StdOut/StdErr Output >>
 waiting 102.764586ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - NO_CONNECTION
 waiting 401.219237ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - IDLE
 waiting 1.198903226s for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - TRANSIENT_FAILURE
 waiting 9.692966566s for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - READY
 probing catalog catsrcdbz5n pod with address catsrcdbz5n.subscription-e2e-kkl2v.svc:50051
 skipping health check
 waiting 198.473035ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - READY
 probing catalog catsrcdbz5n pod with address catsrcdbz5n.subscription-e2e-kkl2v.svc:50051
 skipping health check
 waited 185.8506ms for catalog pod catsrcdbz5n to be available (after catalog update) - READY
 Using the kubectl kubectl binary
 Using the artifacts/subscription-e2e-kkl2v output directory
 Storing the test artifact output in the artifacts/subscription-e2e-kkl2v directory
 Collecting get catalogsources -o yaml output...
 Collecting get subscriptions -o yaml output...
 Collecting get operatorgroups -o yaml output...
 Collecting get clusterserviceversions -o yaml output...
 Collecting get installplans -o yaml output...
 Collecting get pods -o wide output...
 Collecting get events --sort-by .lastTimestamp output...
 << Captured StdOut/StdErr Output

 Timeline >>
 created the subscription-e2e-kkl2v testing namespace
 created the subscription-e2e-kkl2v/subscription-e2e-kkl2v-operatorgroup operator group
 Creating catalog source catsrcdbz5n in namespace subscription-e2e-kkl2v...
 Catalog source catsrcdbz5n created
 03:38:18.4768: subscription subscription-e2e-kkl2v/sub-s7vc6 state: UpgradePending (csv nginx-b): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-kkl2v", Name:"install-6znmr", UID:"05ddba6a-7ff2-46e5-a291-703d1fcaf2ac", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"5483", FieldPath:""}
 waiting 1.200432009s for subscription subscription-e2e-kkl2v/sub-s7vc6 to have state AtLatestKnown: has state UpgradePending
 03:38:24.2762: subscription subscription-e2e-kkl2v/sub-s7vc6 state: AtLatestKnown (csv nginx-b): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-kkl2v", Name:"install-6znmr", UID:"05ddba6a-7ff2-46e5-a291-703d1fcaf2ac", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"5483", FieldPath:""}
 waiting 5.799400416s for subscription subscription-e2e-kkl2v/sub-s7vc6 to have state AtLatestKnown: has state AtLatestKnown
 waiting 199.390965ms for installplan subscription-e2e-kkl2v/install-6znmr to be phases [Complete], in phase Complete
 waiting for CSV subscription-e2e-kkl2v/nginx-a to reach condition
 waited 199.673267ms for csv subscription-e2e-kkl2v/nginx-a - Succeeded (InstallSucceeded): install strategy completed with no errors
 waiting for CSV subscription-e2e-kkl2v/nginx-b to reach condition
 waited 201.395196ms for csv subscription-e2e-kkl2v/nginx-b - Succeeded (InstallSucceeded): install strategy completed with no errors
 Deleting config map catsrcdbz5n-configmap...
 Deleting catalog source catsrcdbz5n...
 waiting for the catalog source catsrcdbz5n-rv9bq pod to be deleted...
 [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 @ 11/30/23 03:43:26.89
 collecting the subscription-e2e-kkl2v namespace artifacts as the 'creation in case of transferring providedAPIs' test case failed
 collecting logs in the ./artifacts/ artifacts directory
 tearing down the subscription-e2e-kkl2v namespace
 resetting e2e kube client
 deleting subscription-e2e-kkl2v/subscription-e2e-kkl2v-operatorgroup
 deleting <global>/subscription-e2e-kkl2v
 garbage collecting CRDs
 deleting crd ins9v2jd.cluster.com
 deleting crd insv7swp.cluster.com
 << Timeline

 [FAILED] 
 	Error Trace:	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2126
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/node.go:463
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:865
 	            				/opt/hostedtoolcache/go/1.20.11/x64/src/runtime/asm_amd64.s:1598
 	Error:      	Received unexpected error:
 	            	failed to wati for catalog source to reach intended state: timed out waiting for the condition
 	Test:       	Subscription creation in case of transferring providedAPIs
 
 In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 @ 11/30/23 03:43:26.89

 Full Stack Trace
   github.com/stretchr/testify/require.NoError({0x7f1b34946088, 0xc001221dd0}, {0x4298c60, 0xc002bdc2c0}, {0x0, 0x0, 0x0})
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/stretchr/testify/require/require.go:1357 +0x96
   github.com/operator-framework/operator-lifecycle-manager/test/e2e.updateInternalCatalog({0x7f1b34945fd8?, 0xc001221dd0}, {0x42f9b40, 0xc0009765a0}, {0x42c7200, 0xc000976e70}, {0xc0013de400, 0xb}, {0xc001908870, 0x16}, ...)
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 +0x897
   github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func25.18()
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2126 +0x1cd8

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov
Copy link
Member Author

Our good friend [FAIL] Subscription [It] creation manual approval

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov
Copy link
Member Author

Wow! three!

 Summarizing 3 Failures:
  [FAIL] Install Plan with CRD schema change Test [It] existing version is present in new CRD (deprecated field)
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571
  [FAIL] Metrics are generated for OLM managed resources Given an OperatorGroup that supports all namespaces when a CSV spec does not include Install Mode [It] generates csv_abnormal metric for OLM pod
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:89
  [FAIL] Install Plan [It] creation with permissions
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2905

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023
@stevekuznetsov
Copy link
Member Author

 • [FAILED] [322.376 seconds]
Subscription [It] creation manual approval
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:322

  Captured StdOut/StdErr Output >>
  waiting 198.501233ms for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - NO_CONNECTION
  waiting 1.600270137s for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - IDLE
  waiting 800.031208ms for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - TRANSIENT_FAILURE
  waiting 14.200389462s for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - READY
  probing catalog mock-ocs pod with address mock-ocs.subscription-e2e-4gzsb.svc:50051
  skipping health check
  Using the kubectl kubectl binary
  Using the artifacts/subscription-e2e-4gzsb output directory
  Storing the test artifact output in the artifacts/subscription-e2e-4gzsb directory
  Collecting get catalogsources -o yaml output...
  Collecting get subscriptions -o yaml output...
  Collecting get operatorgroups -o yaml output...
  Collecting get clusterserviceversions -o yaml output...
  Collecting get installplans -o yaml output...
  Collecting get pods -o wide output...
  Collecting get events --sort-by .lastTimestamp output...
  << Captured StdOut/StdErr Output

  Timeline >>
  created the subscription-e2e-4gzsb testing namespace
  created the subscription-e2e-4gzsb/subscription-e2e-4gzsb-operatorgroup operator group
  created configmap subscription-e2e-4gzsb/mock-ocs
  created catalog source subscription-e2e-4gzsb/mock-ocs
  created subscription subscription-e2e-4gzsb/manual-subscription
  15:17:52.8924: subscription subscription-e2e-4gzsb/manual-subscription state: UpgradePending (csv myapp-stable): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-4gzsb", Name:"install-bp29g", UID:"214d0ae8-1cf5-426f-b231-1045df187dae", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"1595", FieldPath:""}
  waiting 1.598995709s for subscription subscription-e2e-4gzsb/manual-subscription to have state UpgradePending: has state UpgradePending
  waiting 199.853197ms for installplan subscription-e2e-4gzsb/install-bp29g to be phases [RequiresApproval], in phase RequiresApproval
  [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 @ 11/30/23 15:22:53.3
  collecting the subscription-e2e-4gzsb namespace artifacts as the 'creation manual approval' test case failed
  collecting logs in the ./artifacts/ artifacts directory
  tearing down the subscription-e2e-4gzsb namespace
  resetting e2e kube client
  deleting subscription-e2e-4gzsb/subscription-e2e-4gzsb-operatorgroup
  deleting <global>/subscription-e2e-4gzsb
  garbage collecting CRDs
  << Timeline

  [FAILED] Timed out after 300.001s.
  Expected
      <bool>: false
  to be true
  In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 @ 11/30/23 15:22:53.3

  Full Stack Trace
    github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func25.7()
    	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 +0x82d

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Nov 30, 2023
Merged via the queue into operator-framework:master with commit e2b3768 Nov 30, 2023
16 checks passed
@stevekuznetsov
Copy link
Member Author

wow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants