controller/operators: label ConfigMaps, don't assume they are #3119

stevekuznetsov · 2023-11-21T19:08:42Z

In the past, OLM moved to using a label selector to filter the informers that track ConfigMaps in the cluster. However, when this was done, previous ConfigMaps on the cluster that already existed were not labelled. Therefore, on old clusters there is a mix of data - ConfigMaps that OLM created and managed but has now forgotten since they are missing labels, and conformant objects with the label.

We use ConfigMaps to track whether or not Jobs should be labelled - if a Job has an OwnerReference to a ConfigMap and the ConfigMap has an OwnerReference to an OLM GVK, we know that the Job is created and managed by OLM.

During runtime, the two-hop lookup described above is done by using a ConfigMap informer, so we're light on client calls during the labelling phase of startup. However, before the recent labelling work went in, the ConfigMap informer was already filtered by label, so our lookups were dead-ends for the few old ConfigMaps that had never gotten labels in the past. However, on startup we use live clients to determine if there are unlabelled objects we need to handle, so we end up in a state where the live lookup can detect the errant Jobs but the informer-based labellers can't see them as needing labels.

This commit is technically a performance regression, as it reverts the unequivocal ConfigMap informer filtering - we see all ConfigMaps on the cluster during startup, but continue to filter as expected once everything has labels.

Ideally, we can come up with some policies for cleanup of things like these Jobs and ConfigMaps in the future; at a minimum all of the OLM objects should be labelled and visible to the OLM operators from here on out.

awgreene · 2023-11-21T19:46:45Z

/approve

dinhxuanvu

First of all, the reason behind the label filter is to reduce the memory usage that can be a problem with OLM as it literally caches every configmap on the cluster which can be a lot. In practice, OLM doesn't really care much about configmaps except for the ones that it creates for bundle unpack operation.

As you pointed out, it is possible for older configmaps not to have that label. However, there are codes to handle that situation by adding the label to those configmaps in case of cache miss. I wonder if there is a bug somewhere that prevents that logic from working. Did you observe this issue recently?

Either way, I have no problems with this fix as long as the filtering will be applied later to prevent OLM from caching every configmap.

stevekuznetsov · 2023-11-27T15:45:11Z

@dinhxuanvu yes, there must be a bug in the code to label on cache miss, as we're seeing this in the wild.

pkg/controller/operators/catalog/operator.go

In the past, OLM moved to using a label selector to filter the informers that track ConfigMaps in the cluster. However, when this was done, previous ConfigMaps on the cluster that already existed were not labelled. Therefore, on old clusters there is a mix of data - ConfigMaps that OLM created and managed but has now forgotten since they are missing labels, and conformant objects with the label. We use ConfigMaps to track whether or not Jobs should be labelled - if a Job has an OwnerReference to a ConfigMap and the ConfigMap has an OwnerReference to an OLM GVK, we know that the Job is created and managed by OLM. During runtime, the two-hop lookup described above is done by using a ConfigMap informer, so we're light on client calls during the labelling phase of startup. However, before the recent labelling work went in, the ConfigMap informer was *already* filtered by label, so our lookups were dead-ends for the few old ConfigMaps that had never gotten labels in the past. However, on startup we use live clients to determine if there are unlabelled objects we need to handle, so we end up in a state where the live lookup can detect the errant Jobs but the informer-based labellers can't see them as needing labels. This commit is technically a performance regression, as it reverts the unequivocal ConfigMap informer filtering - we see all ConfigMaps on the cluster during startup, but continue to filter as expected once everything has labels. Ideally, we can come up with some policies for cleanup of things like these Jobs and ConfigMaps in the future; at a minimum all of the OLM objects should be labelled and visible to the OLM operators from here on out. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

pkg/controller/registry/reconciler/configmap.go

test/e2e/subscription_e2e_test.go

ConfigMaps provided for the internal source type are user-created and won't have our labels, so we need to use a live client to fetch them. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

joelanford · 2023-11-29T19:41:53Z

/lgtm

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

openshift-ci · 2023-11-29T21:40:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, ncdc, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [awgreene,ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

joelanford · 2023-11-29T21:40:58Z

/lgtm

stevekuznetsov · 2023-11-30T03:54:37Z

the flakes are so horrendous now, holy crap.
new one:

• [FAILED] [322.441 seconds]
Subscription [It] creation in case of transferring providedAPIs
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2028

 Captured StdOut/StdErr Output >>
 waiting 102.764586ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - NO_CONNECTION
 waiting 401.219237ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - IDLE
 waiting 1.198903226s for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - TRANSIENT_FAILURE
 waiting 9.692966566s for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - READY
 probing catalog catsrcdbz5n pod with address catsrcdbz5n.subscription-e2e-kkl2v.svc:50051
 skipping health check
 waiting 198.473035ms for catalog pod subscription-e2e-kkl2v/catsrcdbz5n to be available (for sync) - READY
 probing catalog catsrcdbz5n pod with address catsrcdbz5n.subscription-e2e-kkl2v.svc:50051
 skipping health check
 waited 185.8506ms for catalog pod catsrcdbz5n to be available (after catalog update) - READY
 Using the kubectl kubectl binary
 Using the artifacts/subscription-e2e-kkl2v output directory
 Storing the test artifact output in the artifacts/subscription-e2e-kkl2v directory
 Collecting get catalogsources -o yaml output...
 Collecting get subscriptions -o yaml output...
 Collecting get operatorgroups -o yaml output...
 Collecting get clusterserviceversions -o yaml output...
 Collecting get installplans -o yaml output...
 Collecting get pods -o wide output...
 Collecting get events --sort-by .lastTimestamp output...
 << Captured StdOut/StdErr Output

 Timeline >>
 created the subscription-e2e-kkl2v testing namespace
 created the subscription-e2e-kkl2v/subscription-e2e-kkl2v-operatorgroup operator group
 Creating catalog source catsrcdbz5n in namespace subscription-e2e-kkl2v...
 Catalog source catsrcdbz5n created
 03:38:18.4768: subscription subscription-e2e-kkl2v/sub-s7vc6 state: UpgradePending (csv nginx-b): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-kkl2v", Name:"install-6znmr", UID:"05ddba6a-7ff2-46e5-a291-703d1fcaf2ac", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"5483", FieldPath:""}
 waiting 1.200432009s for subscription subscription-e2e-kkl2v/sub-s7vc6 to have state AtLatestKnown: has state UpgradePending
 03:38:24.2762: subscription subscription-e2e-kkl2v/sub-s7vc6 state: AtLatestKnown (csv nginx-b): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-kkl2v", Name:"install-6znmr", UID:"05ddba6a-7ff2-46e5-a291-703d1fcaf2ac", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"5483", FieldPath:""}
 waiting 5.799400416s for subscription subscription-e2e-kkl2v/sub-s7vc6 to have state AtLatestKnown: has state AtLatestKnown
 waiting 199.390965ms for installplan subscription-e2e-kkl2v/install-6znmr to be phases [Complete], in phase Complete
 waiting for CSV subscription-e2e-kkl2v/nginx-a to reach condition
 waited 199.673267ms for csv subscription-e2e-kkl2v/nginx-a - Succeeded (InstallSucceeded): install strategy completed with no errors
 waiting for CSV subscription-e2e-kkl2v/nginx-b to reach condition
 waited 201.395196ms for csv subscription-e2e-kkl2v/nginx-b - Succeeded (InstallSucceeded): install strategy completed with no errors
 Deleting config map catsrcdbz5n-configmap...
 Deleting catalog source catsrcdbz5n...
 waiting for the catalog source catsrcdbz5n-rv9bq pod to be deleted...
 [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 @ 11/30/23 03:43:26.89
 collecting the subscription-e2e-kkl2v namespace artifacts as the 'creation in case of transferring providedAPIs' test case failed
 collecting logs in the ./artifacts/ artifacts directory
 tearing down the subscription-e2e-kkl2v namespace
 resetting e2e kube client
 deleting subscription-e2e-kkl2v/subscription-e2e-kkl2v-operatorgroup
 deleting <global>/subscription-e2e-kkl2v
 garbage collecting CRDs
 deleting crd ins9v2jd.cluster.com
 deleting crd insv7swp.cluster.com
 << Timeline

 [FAILED] 
 	Error Trace:	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2126
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/node.go:463
 	            				/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/onsi/ginkgo/v2/internal/suite.go:865
 	            				/opt/hostedtoolcache/go/1.20.11/x64/src/runtime/asm_amd64.s:1598
 	Error:      	Received unexpected error:
 	            	failed to wati for catalog source to reach intended state: timed out waiting for the condition
 	Test:       	Subscription creation in case of transferring providedAPIs
 
 In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 @ 11/30/23 03:43:26.89

 Full Stack Trace
   github.com/stretchr/testify/require.NoError({0x7f1b34946088, 0xc001221dd0}, {0x4298c60, 0xc002bdc2c0}, {0x0, 0x0, 0x0})
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/vendor/github.com/stretchr/testify/require/require.go:1357 +0x96
   github.com/operator-framework/operator-lifecycle-manager/test/e2e.updateInternalCatalog({0x7f1b34945fd8?, 0xc001221dd0}, {0x42f9b40, 0xc0009765a0}, {0x42c7200, 0xc000976e70}, {0xc0013de400, 0xb}, {0xc001908870, 0x16}, ...)
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571 +0x897
   github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func25.18()
   	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:2126 +0x1cd8

stevekuznetsov · 2023-11-30T14:21:17Z

Our good friend [FAIL] Subscription [It] creation manual approval

stevekuznetsov · 2023-11-30T14:50:41Z

Wow! three!

 Summarizing 3 Failures:
  [FAIL] Install Plan with CRD schema change Test [It] existing version is present in new CRD (deprecated field)
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:3571
  [FAIL] Metrics are generated for OLM managed resources Given an OperatorGroup that supports all namespaces when a CSV spec does not include Install Mode [It] generates csv_abnormal metric for OLM pod
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/metrics_e2e_test.go:89
  [FAIL] Install Plan [It] creation with permissions
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:2905

stevekuznetsov · 2023-11-30T16:40:22Z

 • [FAILED] [322.376 seconds]
Subscription [It] creation manual approval
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:322

  Captured StdOut/StdErr Output >>
  waiting 198.501233ms for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - NO_CONNECTION
  waiting 1.600270137s for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - IDLE
  waiting 800.031208ms for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - TRANSIENT_FAILURE
  waiting 14.200389462s for catalog pod subscription-e2e-4gzsb/mock-ocs to be available (for sync) - READY
  probing catalog mock-ocs pod with address mock-ocs.subscription-e2e-4gzsb.svc:50051
  skipping health check
  Using the kubectl kubectl binary
  Using the artifacts/subscription-e2e-4gzsb output directory
  Storing the test artifact output in the artifacts/subscription-e2e-4gzsb directory
  Collecting get catalogsources -o yaml output...
  Collecting get subscriptions -o yaml output...
  Collecting get operatorgroups -o yaml output...
  Collecting get clusterserviceversions -o yaml output...
  Collecting get installplans -o yaml output...
  Collecting get pods -o wide output...
  Collecting get events --sort-by .lastTimestamp output...
  << Captured StdOut/StdErr Output

  Timeline >>
  created the subscription-e2e-4gzsb testing namespace
  created the subscription-e2e-4gzsb/subscription-e2e-4gzsb-operatorgroup operator group
  created configmap subscription-e2e-4gzsb/mock-ocs
  created catalog source subscription-e2e-4gzsb/mock-ocs
  created subscription subscription-e2e-4gzsb/manual-subscription
  15:17:52.8924: subscription subscription-e2e-4gzsb/manual-subscription state: UpgradePending (csv myapp-stable): installPlanRef: &v1.ObjectReference{Kind:"InstallPlan", Namespace:"subscription-e2e-4gzsb", Name:"install-bp29g", UID:"214d0ae8-1cf5-426f-b231-1045df187dae", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"1595", FieldPath:""}
  waiting 1.598995709s for subscription subscription-e2e-4gzsb/manual-subscription to have state UpgradePending: has state UpgradePending
  waiting 199.853197ms for installplan subscription-e2e-4gzsb/install-bp29g to be phases [RequiresApproval], in phase RequiresApproval
  [FAILED] in [It] - /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 @ 11/30/23 15:22:53.3
  collecting the subscription-e2e-4gzsb namespace artifacts as the 'creation manual approval' test case failed
  collecting logs in the ./artifacts/ artifacts directory
  tearing down the subscription-e2e-4gzsb namespace
  resetting e2e kube client
  deleting subscription-e2e-4gzsb/subscription-e2e-4gzsb-operatorgroup
  deleting <global>/subscription-e2e-4gzsb
  garbage collecting CRDs
  << Timeline

  [FAILED] Timed out after 300.001s.
  Expected
      <bool>: false
  to be true
  In [It] at: /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 @ 11/30/23 15:22:53.3

  Full Stack Trace
    github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func25.7()
    	/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/subscription_e2e_test.go:358 +0x82d

stevekuznetsov · 2023-11-30T17:18:16Z

wow!

openshift-ci bot requested review from dinhxuanvu and grokspawn November 21, 2023 19:08

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2023

dinhxuanvu reviewed Nov 22, 2023

View reviewed changes

ncdc reviewed Nov 27, 2023

View reviewed changes

pkg/controller/operators/catalog/operator.go Outdated Show resolved Hide resolved

stevekuznetsov force-pushed the skuznets/label-configmaps branch from cf0c02c to a90ba66 Compare November 27, 2023 16:05

ncdc previously approved these changes Nov 27, 2023

View reviewed changes

ncdc enabled auto-merge November 27, 2023 16:10

e2e: fixup custom configmaps with labels

965d337

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov dismissed ncdc’s stale review via 965d337 November 29, 2023 15:46

ncdc previously approved these changes Nov 29, 2023

View reviewed changes

stevekuznetsov dismissed ncdc’s stale review via 7fbf8fd November 29, 2023 16:30

stevekuznetsov commented Nov 29, 2023

View reviewed changes

pkg/controller/registry/reconciler/configmap.go Show resolved Hide resolved

stevekuznetsov force-pushed the skuznets/label-configmaps branch from 7fbf8fd to db9172c Compare November 29, 2023 16:49

joelanford reviewed Nov 29, 2023

View reviewed changes

test/e2e/subscription_e2e_test.go Outdated Show resolved Hide resolved

registry/controller: use a live client for configmaps

39202f3

ConfigMaps provided for the internal source type are user-created and won't have our labels, so we need to use a live client to fetch them. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov force-pushed the skuznets/label-configmaps branch from db9172c to 39202f3 Compare November 29, 2023 19:26

test/e2e: no longer need label

bfc5d73

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

ncdc previously approved these changes Nov 29, 2023

View reviewed changes

openshift-ci bot assigned joelanford Nov 29, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023

controller/registry: deja vu

4ce546b

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov dismissed ncdc’s stale review via 4ce546b November 29, 2023 21:33

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023

ncdc approved these changes Nov 29, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 29, 2023

ncdc added this pull request to the merge queue Nov 29, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 30, 2023