OPRUN-4445: add lifecycle-controller and lifecycle-server #1281
OPRUN-4445: add lifecycle-controller and lifecycle-server #1281fgiudici wants to merge 7 commits intoopenshift:mainfrom
Conversation
Add openshift/controller-runtime-common for shared TLS controller support, and update openshift/library-go crypto package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the lifecycle-server component that serves File-Based Catalog content over HTTPS. The server watches for catalog updates and serves them to OLM for operator discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the lifecycle-controller that reconciles CatalogSource resources to manage lifecycle-server pods. Includes dynamic TLS configuration via controller-runtime-common, NetworkPolicy management, leader election, and server-side apply for owned resources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add deployment manifests for lifecycle-controller and lifecycle-server including RBAC, NetworkPolicy, Service, and IBM Cloud managed variants. Update Dockerfile with lifecycle binary targets, Makefile with build rules, CRD manifests with lifecycle annotations, and the manifest generation script for downstream handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add comprehensive unit tests covering controller reconciliation, TLS configuration, FBC catalog building, and HTTP server behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
|
@fgiudici: This pull request references OPRUN-4445 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fgiudici The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds two new CLIs (lifecycle-controller, lifecycle-server), implements a controller and HTTP server for serving lifecycle FBC data with dynamic TLS and auth, introduces Kubernetes manifests (Deployment/Service/NetworkPolicy/RBAC), updates build/image flow, and simplifies CRD manifest generation. Changes
Sequence Diagram(s)sequenceDiagram
participant Admin as Admin
participant K8sAPI as Kubernetes API
participant Manager as Controller-Runtime Manager
participant Controller as Lifecycle Controller
participant CatalogPod as Catalog Pod
participant TLSProvider as TLSConfig Provider
Admin->>K8sAPI: Create/Update CatalogSource
K8sAPI->>Manager: Event/watch
Manager->>Controller: Reconcile request
Controller->>K8sAPI: Get CatalogSource
Controller->>CatalogPod: Query pods for image digest & node info
CatalogPod-->>Controller: Pod metadata
Controller->>TLSProvider: Get TLS config
TLSProvider-->>Controller: TLS config + unsupported ciphers
Controller->>K8sAPI: Create/Update Deployment, Service, NetworkPolicy, ServiceAccount
Controller->>K8sAPI: Update ClusterRoleBinding subjects
K8sAPI-->>Controller: Resources applied
Controller-->>Manager: Reconciliation complete
sequenceDiagram
participant Client as Client
participant Health as Healthz Endpoint
participant Proxy as Auth/Authz Middleware
participant API as Lifecycle API Handler
participant FBC as FBC Index
Client->>Health: GET /healthz
Health-->>Client: 200 OK
Client->>Proxy: GET /api/v1/lifecycles/pkg
Proxy->>Proxy: Validate token (TokenReview / SAR)
alt Authz succeeds
Proxy->>API: Forward request
API->>FBC: Lookup version/package
alt Found
FBC-->>API: json.RawMessage blob
API->>Client: 200 OK + application/json
else Not found
API->>Client: 404 Not Found
end
else Authz fails
Proxy-->>Client: 403 Forbidden
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/generate_crds_manifests.sh (1)
850-865:⚠️ Potential issue | 🟠 MajorMicroshift kustomization will pull in the TechPreview lifecycle manifests.
files_to_ignorefilters outibm-cloud-managed,psm-operator,collect-profiles,pprof, etc., but does not includelifecycle. As written,0000_50_olm_08-lifecycle-controller.*and0000_50_olm_09-lifecycle-server.rbac.yamlwill be copied intomicroshift-manifests/and appended tokustomization.yaml. Since these resources rely on:
release.openshift.io/feature-set: TechPreviewNoUpgrade(no-op in microshift — always active), andconfig.openshift.io/apiservers(does not exist in microshift — the ClusterRole references it at line 728‑730),this will either ship an always-on TechPreview component on microshift or fail at runtime when the controller tries to watch
apiservers. Please either add"lifecycle"(or the specific filenames) tofiles_to_ignore, or confirm microshift is an intended delivery target.📝 Proposed fix
-files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus") +files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus" "lifecycle-controller" "lifecycle-server")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_crds_manifests.sh` around lines 850 - 865, The microshift kustomization generation loop is currently missing "lifecycle" in the files_to_ignore list, so lifecycle controller/rl roles (e.g., 0000_50_olm_08-lifecycle-controller.* and 0000_50_olm_09-lifecycle-server.rbac.yaml) are being included; update the files_to_ignore array used by the microshift_manifests_files loop to add "lifecycle" (or the specific lifecycle filenames) so the for-loop that appends entries into microshift-manifests/kustomization.yaml will skip those lifecycle manifests, ensuring they are not added to the generated kustomization.yaml.
🧹 Nitpick comments (17)
scripts/generate_crds_manifests.sh (1)
677-791: RBAC/NetworkPolicy consistency nits for the lifecycle-controller manifests.A couple of smaller things worth a second look, not blockers:
- NetPol (lines 693‑697): egress to
6443/TCPis unscoped byto:, so it allows egress to any pod/service/external endpoint on port 6443. Consider constraining with anamespaceSelectorfor the default namespace (kube-apiserver) to follow least-privilege likecollect-profilesdoes above.- ClusterRole leases (line 769): verbs are
get,list,watch,create,update,deletebut the leader-election library used by controller-runtime only needsget,create,update(and optionallylist,watchwhenReleaseOnCancelis off).deletein particular is not required. Trimming unused verbs matches upstream guidance.- ClusterRole for
services(line 746) includesdelete, but the controller is doing server-side apply for owned resources — verifydeleteis actually exercised for garbage collection (e.g. on CatalogSource deletion) vs. relying onownerReferences. If the latter,deletecan be dropped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_crds_manifests.sh` around lines 677 - 791, The NetworkPolicy for lifecycle-controller currently allows egress to port 6443/TCP without scoping; update the manifest created in the heredoc (manifest name 0000_50_olm_08-lifecycle-controller.networkpolicy.yaml) to add a to: with an appropriate namespaceSelector (e.g., select the kube-apiserver/default namespace) to restrict egress to the API server; in the ClusterRole (kind: ClusterRole name: operator-lifecycle-manager-lifecycle-controller in 0000_50_olm_08-lifecycle-controller.rbac.yaml) remove the unnecessary "delete" verb from the coordination.k8s.io/leases rule and trim verbs to only ["get","create","update"] (optionally keep "list","watch" if ReleaseOnCancel is disabled); finally, review the ClusterRole rule for services (apiGroups: [""], resources: ["services"]) and if the controller relies on ownerReferences/server-side-apply for GC, remove the "delete" verb from that rule as well (or retain it only if explicit deletions are exercised).pkg/lifecycle-controller/tls_test.go (1)
75-89: Optional: strengthenGetCertificatePreservedby checking invocation count.The current test relies on a boolean that flips on first call. To guarantee the injected callback is the one invoked via the cloned
*tls.Config, consider asserting a counter increments exactly once (or that the callback pointer survives aGet()clone).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/tls_test.go` around lines 75 - 89, Replace the boolean flag in TestTLSConfigProvider_GetCertificatePreserved with an integer counter so you can assert the injected callback is invoked the expected number of times: change getCert to increment a counter (e.g., calls++), create the provider via NewTLSConfigProvider and obtain cfg via p.Get(), call cfg.GetCertificate(nil), then assert the counter equals 1 (or the exact expected count) instead of checking a bool; reference the getCert closure, NewTLSConfigProvider, p.Get, and cfg.GetCertificate to locate the test and update expectations.manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml (1)
16-23: Consider tightening ingress/egress to namespace or CIDR selectors.Both rules omit
from/topeers, so ingress on 8443 is reachable from any pod in the cluster and egress on 6443 can dial any address (not just the kube-apiserver). The metrics endpoint is TLS-protected, so this is typically acceptable, but you may want to narrow ingress to the monitoring namespace (e.g.,openshift-monitoring) and scope egress for 6443 if you want defense-in-depth consistent with the rest of OLM’s hardening. Not a blocker.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml` around lines 16 - 23, Ingress on port 8443 and egress on port 6443 currently have no peer selectors so they allow traffic from/to anywhere; update the NetworkPolicy stanza that defines ingress: - ports: - port: 8443 and egress: - ports: - port: 6443 to include appropriate peer selectors (e.g., for ingress add a from: with namespaceSelector matching the monitoring namespace such as matchLabels: {name: openshift-monitoring} or a podSelector limiting to monitoring pods, and for egress add a to: with namespaceSelector/podSelector or an ipBlock.cidr that targets the kube-apiserver IPs) so the rules are scoped rather than global.pkg/lifecycle-controller/TODO.md (1)
1-42: Nit: consider tracking these in a Jira/GitHub issue rather than an in-treeTODO.md.In-tree TODO files tend to drift and be forgotten. Since this PR is already tied to
OPRUN-4445, filing a follow-up Jira that links back to this content would make the e2e coverage plan discoverable and actionable outside the repository. The file itself is fine to keep as a checklist in the meantime.Happy to draft the follow-up Jira description if useful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/TODO.md` around lines 1 - 42, Move the in-tree TODO items into a tracked external ticket and link it from the file: create a Jira/GitHub issue (referencing OPRUN-4445) that contains the e2e test checklist from pkg/lifecycle-controller/TODO.md, include expected test scenarios (happy-path and targeted tests) and acceptance criteria, then update pkg/lifecycle-controller/TODO.md to replace the full checklist with a short pointer to the newly created issue (and the issue ID) and optionally note that the file is retained as a temporary local checklist; also update the PR description to reference the new issue so the work is discoverable and actionable outside the repo.Makefile (1)
102-106: Optional: consider embedding version/commit and dropping unused build tags.Two small improvements worth considering:
Unlike
$(OLM_CMDS)and$(REGISTRY_CMDS), these binaries lack-ldflagsinjection ofGitCommit/Version. Whilelifecycle-controllerreads aRELEASE_VERSIONenvironment variable, matching the OLM convention of version-flagged binaries would improve production log visibility.The build tags
-tags "json1,containers_image_openpgp"are unused by either binary—neither importsmattn/go-sqlite3orcontainers/image. Dropping them avoids pulling in unnecessary CGO forjson1.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 102 - 106, Update the Makefile build rules for LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD to embed version metadata and remove unnecessary build tags: add -ldflags settings to inject GitCommit/Version (matching the OLM_CMDS/REGISTRY_CMDS convention) into the go build invocation for the targets LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD, and drop the unused -tags "json1,containers_image_openpgp" from those build lines so CGO/sqlite-related tags are not pulled in for these binaries.pkg/lifecycle-server/fbc.go (3)
108-115: SortListVersionsoutput for deterministic logs.Map iteration order in Go is intentionally randomized, so the version list in
start.go's startup log (log.Info("loaded lifecycle data", ..., "versions", data.ListVersions())) will change between restarts. Sorting produces deterministic diagnostic output.♻️ Suggested refactor
// ListVersions returns the list of versions available in the index func (index LifecycleIndex) ListVersions() []string { versions := make([]string, 0, len(index)) for v := range index { versions = append(versions, v) } + sort.Strings(versions) return versions }Add
"sort"to the imports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-server/fbc.go` around lines 108 - 115, ListVersions currently returns versions in map iteration order which is non-deterministic; modify LifecycleIndex.ListVersions to sort the versions before returning them (use sort.Strings on the slice) so logs (e.g., the startup log in start.go that calls data.ListVersions()) are deterministic, and add the "sort" import to the file.
42-87: Silent error skip on per-file walk errors loses debuggability.At line 53-54, any error from
WalkMetasFSon an individual file (malformed JSON, I/O error, unreadable entry) is silently dropped. Given this data is consumed by a public API and a single bad file can drop a lifecycle blob for a package, surfacing these would help operators diagnose a missing package. Consider accepting alogr.Logger(or returning a slice of non-fatal warnings) so callers can log per-file problems without aborting the walk.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-server/fbc.go` around lines 42 - 87, LoadLifecycleData currently swallows per-file errors inside the declcfg.WalkMetasFS callback, losing debuggability; modify LoadLifecycleData to accept a logger (e.g., logr.Logger) or a pointer to a slice of warning errors, and in the WalkMetasFS callback, instead of returning nil on per-file errors (the err parameter) or other parse/validation issues, record them by logging via the provided logger or appending to the warnings slice (but continue walking), keeping the existing behavior of returning a single terminal error only if WalkMetasFS itself fails; update the function signature (LoadLifecycleData) and callers accordingly and reference the meta/err handling inside the anonymous callback where schemaVersionRegex, meta.Package, meta.Blob and the result map are used so individual file problems are surfaced without aborting the overall walk.
70-77: Consider logging or rejecting duplicate(version, package)entries.A later-walked FBC file silently overwrites an earlier blob at line 76 with no indication. This is surprising if a catalog accidentally ships two lifecycle blobs for the same package/version, and whichever is walked last wins (order is not guaranteed by
WalkMetasFS). A debug log line here (or an explicit error-on-duplicate option) would make the behavior observable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-server/fbc.go` around lines 70 - 77, When inserting into result under schemaVersion and meta.Package (the block using mu.Lock()/mu.Unlock() that assigns result[schemaVersion][meta.Package] = meta.Blob), check whether an entry already exists and either log a debug/warn message with the existing vs new blob identity or return an error to reject duplicates; specifically, inside that critical section look up result[schemaVersion][meta.Package] before assignment, and if non-nil call the package’s logger or lifecycle server logger with schemaVersion and meta.Package and indicate a duplicate (optionally include a short blob fingerprint/length) or propagate an error instead of silently overwriting—this makes duplicate (version, package) entries from WalkMetasFS observable and avoid silent last-wins behavior.pkg/lifecycle-controller/controller_test.go (1)
125-129: Edge case: empty name produces invalid DNS label.
resourceName("")returns"-lifecycle-server", which starts with a hyphen and is not a valid DNS-1123 label — any attempt to create a resource with this name will be rejected by the API server. In practiceCatalogSource.Nameis validated by Kubernetes and can't be empty, so this is defensive only; consider either removing this test case or havingresourceNamereturn an error / panic for empty input to make the contract explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/controller_test.go` around lines 125 - 129, The test shows resourceName("") returns "-lifecycle-server", producing an invalid DNS-1123 label; update the code/tests so the contract is explicit: either remove the "empty name" test case from controller_test.go, or change resourceName to validate input and return an error or panic for empty/invalid names (e.g., check in resourceName for "" and handle accordingly). Locate the function resourceName and the test case named "empty name" to implement the chosen fix and ensure callers (e.g., any CatalogSource name handling) are updated to handle the new error/behavior.cmd/lifecycle-controller/start.go (2)
124-135:LIFECYCLE_SERVER_IMAGErequired check runs before TLS cert existence check — fine, but consider consolidated validation.Not a bug, but
loadStartConfigmixes validation (env vars, cert file load, selectors) with construction (REST config, scheme, leader election). Extracting avalidateFlags()step that runs first would produce a single clear error message per invalid input rather than failing on the first one. Optional cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/lifecycle-controller/start.go` around lines 124 - 135, The loadStartConfig function currently intermixes environment/flag validation with setup logic; extract a new validateStartConfig (or validateFlags) helper and move all input-only checks (NAMESPACE presence when !disableLeaderElection, LIFECYCLE_SERVER_IMAGE non-empty, TLS cert existence and any selector validation) into that function so loadStartConfig only constructs and returns the startConfig and runtime objects (REST config, scheme, leader election flags); call validateStartConfig early from the startup path before invoking loadStartConfig, keep the same error messages, and reference the existing symbols loadStartConfig and disableLeaderElection to locate where to split validation from construction.
251-256:healthzcheck is a no-op stub.The
/healthzhandler always returnsnil, so it will stay green even if the cache hasn't synced, the leader hasn't been elected, or the TLS provider hasn't been initialized. Controller-runtime exposesmgr.GetCache().WaitForCacheSyncand similar hooks that could back a meaningful readiness check. Low priority since most OLM deployments have historically used this pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/lifecycle-controller/start.go` around lines 251 - 256, The current healthz handler added via mgr.AddHealthzCheck("healthz", ...) is a no-op; replace it with a readiness check that returns an error until required subsystems are ready: use mgr.GetCache().WaitForCacheSync(ctx) to ensure caches are synced, verify leader election/leadership if applicable (e.g., check a stored elected flag or manager/leader utilities), and confirm the TLS provider has been initialized (check the TLS provider variable/state used elsewhere in start.go). Implement the closure to perform these checks (using the appropriate context available in start.go), return nil when all pass and an error describing the failing condition otherwise, and keep the AddHealthzCheck call signature the same.microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml (1)
78-90: Consider adding a startupProbe giveninitialDelaySeconds: 30.With both liveness and readiness probes using
initialDelaySeconds: 30and defaultperiodSeconds/failureThreshold, a slow cold start (e.g., leader-election contention, API server warm-up) could trigger premature liveness failures and restart loops. AstartupProbewith a generousfailureThresholdwould decouple cold-start tolerance from steady-state liveness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml` around lines 78 - 90, Add a startupProbe to the container alongside the existing livenessProbe and readinessProbe to tolerate slow cold starts: define startupProbe with an httpGet to path /healthz on port "health" and scheme HTTP, set a modest initialDelaySeconds (e.g., 10), a larger periodSeconds (e.g., 10) and a generous failureThreshold (e.g., 30) so startup failures don’t trigger liveness restarts; update the same container spec that contains livenessProbe, readinessProbe and terminationMessagePolicy to include this startupProbe stanza.manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml (2)
52-55: Consider narrowingclusterrolebindingspermissions to the managed CRB name.The controller manages exactly one CRB (
clusterRoleBindingName, perpkg/lifecycle-controller/controller.go), but this rule grants full CRUD on every CRB in the cluster. Sincecreatecan't be scoped byresourceNames, split the rule so that write verbs are scoped to the specific CRB to limit blast radius:🛡️ Proposed RBAC hardening
- # Manage lifecycle-server clusterrolebindings - - apiGroups: ["rbac.authorization.k8s.io"] - resources: ["clusterrolebindings"] - verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + # Manage lifecycle-server clusterrolebindings + - apiGroups: ["rbac.authorization.k8s.io"] + resources: ["clusterrolebindings"] + verbs: ["get", "list", "watch", "create"] + - apiGroups: ["rbac.authorization.k8s.io"] + resources: ["clusterrolebindings"] + resourceNames: ["operator-lifecycle-manager-lifecycle-server"] + verbs: ["update", "patch", "delete"]Note: the Trivy KSV-0056 hits on
services/networkpoliciesappear to be false positives — these are required for the controller's core responsibility of reconciling per-CatalogSourcelifecycle-server resources.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml` around lines 52 - 55, The ClusterRole currently grants full CRUD on all clusterrolebindings; narrow write access by splitting the rule: keep a broad read rule for "clusterrolebindings" with verbs ["get","list","watch"], and add a separate rule for write verbs ["create","update","patch","delete"] that includes resourceNames set to the specific ClusterRoleBinding name used by the controller (clusterRoleBindingName as referenced in pkg/lifecycle-controller/controller.go); update the RBAC manifest to reflect these two rules so only the managed CRB can be modified while reads remain cluster-wide.
63-69: Consider addingresourceNamesfield to narrow the RBAC permissions to the specific lease and configmap resources used by the controller.Verification confirms the controller uses
"operator-lifecycle-manager-lifecycle-server"as both the cluster role and binding name (inpkg/lifecycle-controller/controller.golines 62-63). To follow least-privilege principles, the lease and configmap rules should be restricted usingresourceNames: ["operator-lifecycle-manager-lifecycle-server"].Example narrowing
# Leader election - apiGroups: ["coordination.k8s.io"] resources: ["leases"] resourceNames: ["operator-lifecycle-manager-lifecycle-server"] verbs: ["get", "list", "watch", "create", "update", "delete"] - apiGroups: [""] resources: ["configmaps"] resourceNames: ["operator-lifecycle-manager-lifecycle-server"] verbs: ["get", "list", "watch", "create", "update", "delete"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml` around lines 63 - 69, The RBAC rules for leader election currently grant broad access to leases and configmaps; narrow them by adding resourceNames: ["operator-lifecycle-manager-lifecycle-server"] to both rules so only the specific lease and configmap used by the lifecycle controller are allowed (match the controller name referenced in pkg/lifecycle-controller/controller.go as operator-lifecycle-manager-lifecycle-server).pkg/lifecycle-controller/controller.go (3)
339-348: Two naming/path hygiene nits.
resourceName(line 339-348): truncation to 63 chars is DNS-safe in length but doesn't guard against collisions between two CatalogSources whose first 63 characters of<csName>-lifecycle-servermatch. CatalogSource names are usually short so the risk is low, but if you want to be defensive, append a short hash of the full name before truncating (e.g., truncate to 57 chars +-+ 5-char sha256 prefix).fbcPath(line 396):catalogMountPath + catalogDirrelies oncatalogDirstarting with/. The default"/configs"is fine, but a user-suppliedcs.Spec.GrpcPodConfig.ExtractContent.CatalogDirwithout a leading slash (e.g.,"configs") would produce"/catalogconfigs". Usepath.Join(catalogMountPath, catalogDir)to normalize.🔧 fbcPath fix
- const catalogMountPath = "/catalog" - fbcPath := catalogMountPath + catalogDir + const catalogMountPath = "/catalog" + fbcPath := path.Join(catalogMountPath, catalogDir)Also applies to: 390-396
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/controller.go` around lines 339 - 348, resourceName currently truncates to 63 chars which can collide; change it to compute a short hash (e.g., sha256) of the full csName, take a 5-character hex prefix, then truncate the human-readable portion to 57 chars and append "-" + the 5-char prefix before final trim/lowercase (update function resourceName to produce "<truncated-57>-<5hex>" and still enforce '-' trimming and lowercase). For fbcPath construction, stop concatenating strings and use path.Join(catalogMountPath, catalogDir) (ensure you import "path") so user-supplied catalogDir with or without leading slash is normalized; update the code that builds fbcPath (using cs.Spec.GrpcPodConfig.ExtractContent.CatalogDir and catalogMountPath) accordingly.
174-208: Reconcile-driven cleanup is fragile without OwnerReferences.
ensureResourcesapplies SA/Service/Deployment/NetworkPolicy via SSA without anyOwnerReferencesto theCatalogSource. Cleanup depends entirely on the controller observing a delete/no-match event viacleanupResources. If the controller is down when a CatalogSource is deleted, the tombstone event can be missed (no relist delivers a DeleteEvent for an object already gone), and the per-catalog resources orphan indefinitely.Since the resources are in the same namespace as the
CatalogSource, setting anOwnerReferenceon each applied object lets the Kubernetes garbage collector clean them up for free. This also simplifiescleanupResources(it becomes mostly unnecessary for the common deletion path).If cross-namespace ownership ever becomes a requirement, a finalizer on
CatalogSourceis the robust alternative.Also applies to: 270-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/controller.go` around lines 174 - 208, ensureResources currently applies SA/Service/Deployment/NetworkPolicy without OwnerReferences, which can orphan resources if the controller misses delete events; update the resource builders (buildServiceAccount, buildService, buildDeployment, buildNetworkPolicy) or immediately before calling Apply in ensureResources to set a controller OwnerReference to the CatalogSource (use controllerutil.SetControllerReference or populate metav1.OwnerReference with cs as the owner) so the Kubernetes GC will remove per-catalog resources when the CatalogSource is deleted (ensure Namespace matches the CatalogSource namespace); you may then simplify cleanupResources since GC will handle the common delete path.
210-239:reconcileClusterRoleBindinglists every CatalogSource cluster-wide and re-applies the CRB on every reconcile.Two small but worthwhile improvements:
- Line 214 lists all CatalogSources in the cluster without a selector, then re-filters client-side with
matchesCatalogSource. If the cache for CatalogSources isn't already filtered, this is O(N) per reconcile and scales poorly with the total number of catalogs. Consider usingclient.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector}as a List option to narrow the server/cache scan.- The CRB is re-applied unconditionally on every reconcile of any CatalogSource (and pod, SA, service, NP event that enqueues one). In clusters with many matching catalogs, this becomes a hot write path. Server-side apply is a no-op when the desired state matches existing managed fields, but the round-trip cost still adds up. An easy mitigation is to compute the subject set and skip
r.Applywhen it equals the current CRB's subjects.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lifecycle-controller/controller.go` around lines 210 - 239, reconcileClusterRoleBinding currently calls r.List without a selector and always re-applies the CRB; change the List call to use client.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector} so only matching CatalogSources are returned (refer to reconcileClusterRoleBinding and matchesCatalogSource/CatalogSourceLabelSelector), and after building the subjects slice (from resourceName and ServiceAccounts) fetch the existing ClusterRoleBinding (via r.Get) and compare its current subjects to the computed subjects; if they are identical, skip the r.Apply call to avoid unnecessary server-side apply round-trips, otherwise proceed to r.Apply as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/lifecycle-controller/start.go`:
- Around line 189-199: getInitialTLSProfile currently swallows errors from
tlsutil.FetchAPIServerTLSProfile and silently falls back to the default, which
hides startup RBAC/API errors and disables the watcher; modify
getInitialTLSProfile so that it logs the error (using the setupLog passed down
from loadStartConfig) and returns the error to the caller unless the failure is
a clear "resource not found" case, in which case you may continue to return the
default profile and a false flag; thread setupLog through loadStartConfig into
getInitialTLSProfile and update callers to handle the returned error so
transient API/RBAC failures surface instead of being masked.
- Around line 89-93: The deferred close of tlsProfileChan in
setupTLSProfileWatcher can race with blocking sends from the OnProfileChange
callback and cause panics or leaked goroutines; remove the defer
close(tlsProfileChan) and instead modify the OnProfileChange callback (which
receives a context) to gate its send on tlsProfileChan with a select that checks
ctx.Done() (drop the send if ctx is done) so sends never block or attempt to
send on a closed channel; leave the channel unclosed and let it be GC'd when no
longer referenced.
In `@manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: Replace the incorrect tiny Go memory cap by increasing the
GOMEMLIMIT and the container memory request for the lifecycle-controller
deployment: update the env var GOMEMLIMIT from "5MiB" to a realistic value (e.g.
"50MiB" or match the lifecycle-server setting used in
pkg/lifecycle-controller/controller.go), and raise resources.requests.memory
(and CPU if needed) under the same container spec (resources.requests.memory,
resources.requests.cpu). Apply the same changes to the corresponding manifest
for IBM Cloud Managed
(0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml) so both
deployments use the higher GOMEMLIMIT and appropriate memory request.
In `@microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml`:
- Around line 4715-4718: The schema for the CRD includes a contradictory default
for the required field "ip": remove the line setting default: "" for the "ip"
property so the "ip" field remains required without a default; locate the "ip"
property in the CRD schema (the block containing x-kubernetes-list-map-keys: -
ip and x-kubernetes-list-type: map) and delete the default entry while leaving
the required list and other metadata intact.
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: The GOMEMLIMIT environment variable is misconfigured as
"5MiB" and the pod resource requests are far too low; update the Deployment for
lifecycle-controller to remove or set GOMEMLIMIT to a realistic Go heap cap (for
example "1Gi" or a value tuned from observed usage) and increase
resources.requests.cpu and resources.requests.memory from 10m/10Mi to sensible
defaults (for example 100m CPU and 200Mi memory) and ensure you make the same
change for the other manifest copy that contains the same GOMEMLIMIT and
resource request entries.
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml`:
- Around line 16-19: The NetworkPolicy's ingress rule currently exposes port
8443 without a from: selector (ingress -> ports -> port: 8443), relying on
kube-rbac-proxy and the lifecycle-server RBAC for access control; if you later
want to restrict callers, update this NetworkPolicy to add a from: block (e.g.,
ingress.from with podSelector or namespaceSelector) that limits traffic to the
monitoring/metrics pods or a specific namespace/service account, ensuring the
rule explicitly selects only allowed sources instead of permitting cluster-wide
ingress to port 8443.
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml`:
- Around line 52-72: Add the missing "bind" verb to the ClusterRole that grants
permissions related to operator-lifecycle-manager-lifecycle-server (i.e., the
rule covering apiGroups: ["authentication.k8s.io"] resources: ["tokenreviews"]
and apiGroups: ["authorization.k8s.io"] resources: ["subjectaccessreviews"]) so
the controller service account can create ClusterRoleBindings without future
privilege-escalation breaks; also remove the unused configmaps rule (apiGroups:
[""], resources: ["configmaps"]) and drop the "delete" verb from the leases rule
(apiGroups: ["coordination.k8s.io"], resources: ["leases"]) and from any
remaining configmaps/events verbs to enforce least privilege and reflect that
leader election uses LeasesResourceLock only.
In `@pkg/lifecycle-controller/controller_test.go`:
- Around line 826-887: The table-driven test declares and checks the expectErr
field but never sets it; either remove the unused expectErr field and the
conditional branch that checks tc.expectErr in the test, or add a dedicated test
case that sets expectErr=true and constructs a failing client to force
getCatalogPodInfo to return an error. Locate the test table and loop in
TestGetCatalogPodInfo (the tt slice and the t.Run loop), the tc.expectErr check,
and the call to r.getCatalogPodInfo(ctx, cs); then either delete the expectErr
field from the tt struct and the if tc.expectErr branch, or add a new entry that
triggers an error (e.g., create a client that returns an error when listing
Pods) and set expectErr:true so the error path is exercised.
In `@pkg/lifecycle-controller/tls_test.go`:
- Around line 91-122: In TestTLSConfigProvider_ConcurrentAccess, avoid calling
require.NotNil from spawned goroutines; change the in-goroutine checks to
assert.NotNil(t, cfg) and keep require.NotNil only on the main test goroutine
after wg.Wait(); also add the import "github.com/stretchr/testify/assert" to the
test file so assert is available. Ensure the references are the require.NotNil
call inside the anonymous reader goroutine in
TestTLSConfigProvider_ConcurrentAccess and not the final require.NotNil after
wg.Wait().
In `@scripts/generate_crds_manifests.sh`:
- Around line 657-675: The Service for lifecycle-controller exposes TLS metrics
on port 8443 but no ServiceMonitor is emitted; add a ServiceMonitor manifest for
the lifecycle-controller that targets selector app: olm-lifecycle-controller in
namespace openshift-operator-lifecycle-manager, with spec.endpoints containing
port: metrics, scheme: https, and tlsConfig that references the serving cert
secret (secretName: lifecycle-controller-serving-cert) so Prometheus uses the CA
bundle (and set insecureSkipVerify: false); ensure labels/namespace match the
Service and follow the pattern used for the PSM ServiceMonitor (use the same
endpoint name, selector, and namespace); alternatively, if monitoring is
intentionally out-of-scope, add a TODO comment in the script near the
lifecycle-controller Service block documenting the omission.
- Around line 589-608: The lifecycle-controller pod spec currently hardcodes a
single CatalogSource selector and an unsafe GOMEMLIMIT; update the deployment to
make the catalog selector configurable (do not hardcode
"metadata.namespace=openshift-marketplace,metadata.name=redhat-operators" in the
args passed to /bin/lifecycle-controller) by wiring a config value/flag in your
cluster-olm-operator reconciler (expose a field or feature gate that populates
the --catalog-source-field-selector arg or leave it unset) and add a clarifying
comment if this is intentionally a tech‑preview guardrail; also remove or raise
the environment GOMEMLIMIT value (currently "5MiB") in the pod env so it is at
least in line with the container memory request (or omit the variable to use Go
defaults) to avoid GC thrashing—update the lifecycle-controller args and the env
entry for GOMEMLIMIT (and any related config fields like RELEASE_VERSION /
LIFECYCLE_SERVER_IMAGE if you surface the selector via config) accordingly.
- Around line 49-50: The current copy step in scripts/generate_crds_manifests.sh
simply copies staged CRDs from "${ROOT_DIR}/staging/api/crds" to "${crddir}/"
but lacks documentation about update triggers and why controller-gen/yq fixups
were removed; update the script header comments to state how staging/api/crds is
updated (e.g., via upstream sync cadence or manual cherry-pick) and reference
the existing commit governance check (verify_commits.sh) that enforces
single-upstream commit traceability, and add a short note explaining that
controller-gen regeneration and yq mutations were removed because CRD manifests
are consumed verbatim from upstream and thus no local schema fixes are required;
mention the variables/commands to look at (cp, ROOT_DIR, crddir,
verify_commits.sh) so reviewers can locate the relevant logic.
---
Outside diff comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 850-865: The microshift kustomization generation loop is currently
missing "lifecycle" in the files_to_ignore list, so lifecycle controller/rl
roles (e.g., 0000_50_olm_08-lifecycle-controller.* and
0000_50_olm_09-lifecycle-server.rbac.yaml) are being included; update the
files_to_ignore array used by the microshift_manifests_files loop to add
"lifecycle" (or the specific lifecycle filenames) so the for-loop that appends
entries into microshift-manifests/kustomization.yaml will skip those lifecycle
manifests, ensuring they are not added to the generated kustomization.yaml.
---
Nitpick comments:
In `@cmd/lifecycle-controller/start.go`:
- Around line 124-135: The loadStartConfig function currently intermixes
environment/flag validation with setup logic; extract a new validateStartConfig
(or validateFlags) helper and move all input-only checks (NAMESPACE presence
when !disableLeaderElection, LIFECYCLE_SERVER_IMAGE non-empty, TLS cert
existence and any selector validation) into that function so loadStartConfig
only constructs and returns the startConfig and runtime objects (REST config,
scheme, leader election flags); call validateStartConfig early from the startup
path before invoking loadStartConfig, keep the same error messages, and
reference the existing symbols loadStartConfig and disableLeaderElection to
locate where to split validation from construction.
- Around line 251-256: The current healthz handler added via
mgr.AddHealthzCheck("healthz", ...) is a no-op; replace it with a readiness
check that returns an error until required subsystems are ready: use
mgr.GetCache().WaitForCacheSync(ctx) to ensure caches are synced, verify leader
election/leadership if applicable (e.g., check a stored elected flag or
manager/leader utilities), and confirm the TLS provider has been initialized
(check the TLS provider variable/state used elsewhere in start.go). Implement
the closure to perform these checks (using the appropriate context available in
start.go), return nil when all pass and an error describing the failing
condition otherwise, and keep the AddHealthzCheck call signature the same.
In `@Makefile`:
- Around line 102-106: Update the Makefile build rules for
LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD to embed version metadata and
remove unnecessary build tags: add -ldflags settings to inject GitCommit/Version
(matching the OLM_CMDS/REGISTRY_CMDS convention) into the go build invocation
for the targets LIFECYCLE_CONTROLLER_CMD and LIFECYCLE_SERVER_CMD, and drop the
unused -tags "json1,containers_image_openpgp" from those build lines so
CGO/sqlite-related tags are not pulled in for these binaries.
In `@manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yaml`:
- Around line 16-23: Ingress on port 8443 and egress on port 6443 currently have
no peer selectors so they allow traffic from/to anywhere; update the
NetworkPolicy stanza that defines ingress: - ports: - port: 8443 and egress: -
ports: - port: 6443 to include appropriate peer selectors (e.g., for ingress add
a from: with namespaceSelector matching the monitoring namespace such as
matchLabels: {name: openshift-monitoring} or a podSelector limiting to
monitoring pods, and for egress add a to: with namespaceSelector/podSelector or
an ipBlock.cidr that targets the kube-apiserver IPs) so the rules are scoped
rather than global.
In `@manifests/0000_50_olm_08-lifecycle-controller.rbac.yaml`:
- Around line 52-55: The ClusterRole currently grants full CRUD on all
clusterrolebindings; narrow write access by splitting the rule: keep a broad
read rule for "clusterrolebindings" with verbs ["get","list","watch"], and add a
separate rule for write verbs ["create","update","patch","delete"] that includes
resourceNames set to the specific ClusterRoleBinding name used by the controller
(clusterRoleBindingName as referenced in
pkg/lifecycle-controller/controller.go); update the RBAC manifest to reflect
these two rules so only the managed CRB can be modified while reads remain
cluster-wide.
- Around line 63-69: The RBAC rules for leader election currently grant broad
access to leases and configmaps; narrow them by adding resourceNames:
["operator-lifecycle-manager-lifecycle-server"] to both rules so only the
specific lease and configmap used by the lifecycle controller are allowed (match
the controller name referenced in pkg/lifecycle-controller/controller.go as
operator-lifecycle-manager-lifecycle-server).
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 78-90: Add a startupProbe to the container alongside the existing
livenessProbe and readinessProbe to tolerate slow cold starts: define
startupProbe with an httpGet to path /healthz on port "health" and scheme HTTP,
set a modest initialDelaySeconds (e.g., 10), a larger periodSeconds (e.g., 10)
and a generous failureThreshold (e.g., 30) so startup failures don’t trigger
liveness restarts; update the same container spec that contains livenessProbe,
readinessProbe and terminationMessagePolicy to include this startupProbe stanza.
In `@pkg/lifecycle-controller/controller_test.go`:
- Around line 125-129: The test shows resourceName("") returns
"-lifecycle-server", producing an invalid DNS-1123 label; update the code/tests
so the contract is explicit: either remove the "empty name" test case from
controller_test.go, or change resourceName to validate input and return an error
or panic for empty/invalid names (e.g., check in resourceName for "" and handle
accordingly). Locate the function resourceName and the test case named "empty
name" to implement the chosen fix and ensure callers (e.g., any CatalogSource
name handling) are updated to handle the new error/behavior.
In `@pkg/lifecycle-controller/controller.go`:
- Around line 339-348: resourceName currently truncates to 63 chars which can
collide; change it to compute a short hash (e.g., sha256) of the full csName,
take a 5-character hex prefix, then truncate the human-readable portion to 57
chars and append "-" + the 5-char prefix before final trim/lowercase (update
function resourceName to produce "<truncated-57>-<5hex>" and still enforce '-'
trimming and lowercase). For fbcPath construction, stop concatenating strings
and use path.Join(catalogMountPath, catalogDir) (ensure you import "path") so
user-supplied catalogDir with or without leading slash is normalized; update the
code that builds fbcPath (using cs.Spec.GrpcPodConfig.ExtractContent.CatalogDir
and catalogMountPath) accordingly.
- Around line 174-208: ensureResources currently applies
SA/Service/Deployment/NetworkPolicy without OwnerReferences, which can orphan
resources if the controller misses delete events; update the resource builders
(buildServiceAccount, buildService, buildDeployment, buildNetworkPolicy) or
immediately before calling Apply in ensureResources to set a controller
OwnerReference to the CatalogSource (use controllerutil.SetControllerReference
or populate metav1.OwnerReference with cs as the owner) so the Kubernetes GC
will remove per-catalog resources when the CatalogSource is deleted (ensure
Namespace matches the CatalogSource namespace); you may then simplify
cleanupResources since GC will handle the common delete path.
- Around line 210-239: reconcileClusterRoleBinding currently calls r.List
without a selector and always re-applies the CRB; change the List call to use
client.MatchingLabelsSelector{Selector: r.CatalogSourceLabelSelector} so only
matching CatalogSources are returned (refer to reconcileClusterRoleBinding and
matchesCatalogSource/CatalogSourceLabelSelector), and after building the
subjects slice (from resourceName and ServiceAccounts) fetch the existing
ClusterRoleBinding (via r.Get) and compare its current subjects to the computed
subjects; if they are identical, skip the r.Apply call to avoid unnecessary
server-side apply round-trips, otherwise proceed to r.Apply as before.
In `@pkg/lifecycle-controller/tls_test.go`:
- Around line 75-89: Replace the boolean flag in
TestTLSConfigProvider_GetCertificatePreserved with an integer counter so you can
assert the injected callback is invoked the expected number of times: change
getCert to increment a counter (e.g., calls++), create the provider via
NewTLSConfigProvider and obtain cfg via p.Get(), call cfg.GetCertificate(nil),
then assert the counter equals 1 (or the exact expected count) instead of
checking a bool; reference the getCert closure, NewTLSConfigProvider, p.Get, and
cfg.GetCertificate to locate the test and update expectations.
In `@pkg/lifecycle-controller/TODO.md`:
- Around line 1-42: Move the in-tree TODO items into a tracked external ticket
and link it from the file: create a Jira/GitHub issue (referencing OPRUN-4445)
that contains the e2e test checklist from pkg/lifecycle-controller/TODO.md,
include expected test scenarios (happy-path and targeted tests) and acceptance
criteria, then update pkg/lifecycle-controller/TODO.md to replace the full
checklist with a short pointer to the newly created issue (and the issue ID) and
optionally note that the file is retained as a temporary local checklist; also
update the PR description to reference the new issue so the work is discoverable
and actionable outside the repo.
In `@pkg/lifecycle-server/fbc.go`:
- Around line 108-115: ListVersions currently returns versions in map iteration
order which is non-deterministic; modify LifecycleIndex.ListVersions to sort the
versions before returning them (use sort.Strings on the slice) so logs (e.g.,
the startup log in start.go that calls data.ListVersions()) are deterministic,
and add the "sort" import to the file.
- Around line 42-87: LoadLifecycleData currently swallows per-file errors inside
the declcfg.WalkMetasFS callback, losing debuggability; modify LoadLifecycleData
to accept a logger (e.g., logr.Logger) or a pointer to a slice of warning
errors, and in the WalkMetasFS callback, instead of returning nil on per-file
errors (the err parameter) or other parse/validation issues, record them by
logging via the provided logger or appending to the warnings slice (but continue
walking), keeping the existing behavior of returning a single terminal error
only if WalkMetasFS itself fails; update the function signature
(LoadLifecycleData) and callers accordingly and reference the meta/err handling
inside the anonymous callback where schemaVersionRegex, meta.Package, meta.Blob
and the result map are used so individual file problems are surfaced without
aborting the overall walk.
- Around line 70-77: When inserting into result under schemaVersion and
meta.Package (the block using mu.Lock()/mu.Unlock() that assigns
result[schemaVersion][meta.Package] = meta.Blob), check whether an entry already
exists and either log a debug/warn message with the existing vs new blob
identity or return an error to reject duplicates; specifically, inside that
critical section look up result[schemaVersion][meta.Package] before assignment,
and if non-nil call the package’s logger or lifecycle server logger with
schemaVersion and meta.Package and indicate a duplicate (optionally include a
short blob fingerprint/length) or propagate an error instead of silently
overwriting—this makes duplicate (version, package) entries from WalkMetasFS
observable and avoid silent last-wins behavior.
In `@scripts/generate_crds_manifests.sh`:
- Around line 677-791: The NetworkPolicy for lifecycle-controller currently
allows egress to port 6443/TCP without scoping; update the manifest created in
the heredoc (manifest name
0000_50_olm_08-lifecycle-controller.networkpolicy.yaml) to add a to: with an
appropriate namespaceSelector (e.g., select the kube-apiserver/default
namespace) to restrict egress to the API server; in the ClusterRole (kind:
ClusterRole name: operator-lifecycle-manager-lifecycle-controller in
0000_50_olm_08-lifecycle-controller.rbac.yaml) remove the unnecessary "delete"
verb from the coordination.k8s.io/leases rule and trim verbs to only
["get","create","update"] (optionally keep "list","watch" if ReleaseOnCancel is
disabled); finally, review the ClusterRole rule for services (apiGroups: [""],
resources: ["services"]) and if the controller relies on
ownerReferences/server-side-apply for GC, remove the "delete" verb from that
rule as well (or retain it only if explicit deletions are exercised).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a7892ad0-ead2-4a1e-a85a-8ff58999794f
⛔ Files ignored due to path filters (6)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/controller-runtime-common/LICENSEis excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/controller-runtime-common/pkg/tls/controller.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/controller-runtime-common/pkg/tls/tls.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/library-go/pkg/crypto/crypto.gois excluded by!**/vendor/**,!vendor/**vendor/modules.txtis excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (36)
Makefilecmd/lifecycle-controller/main.gocmd/lifecycle-controller/start.gocmd/lifecycle-controller/util.gocmd/lifecycle-server/main.gocmd/lifecycle-server/start.gocmd/lifecycle-server/util.gogo.modmanifests/0000_50_olm_00-catalogsources.crd.yamlmanifests/0000_50_olm_00-clusterserviceversions.crd.yamlmanifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yamlmanifests/0000_50_olm_08-lifecycle-controller.deployment.yamlmanifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yamlmanifests/0000_50_olm_08-lifecycle-controller.rbac.yamlmanifests/0000_50_olm_08-lifecycle-controller.service.yamlmanifests/0000_50_olm_09-lifecycle-server.rbac.yamlmicroshift-manifests/0000_50_olm_00-catalogsources.crd.yamlmicroshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.networkpolicy.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.rbac.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.service.yamlmicroshift-manifests/0000_50_olm_09-lifecycle-server.rbac.yamlmicroshift-manifests/kustomization.yamloperator-lifecycle-manager.Dockerfilepkg/lifecycle-controller/TODO.mdpkg/lifecycle-controller/controller.gopkg/lifecycle-controller/controller_test.gopkg/lifecycle-controller/tls.gopkg/lifecycle-controller/tls_test.gopkg/lifecycle-server/fbc.gopkg/lifecycle-server/fbc_test.gopkg/lifecycle-server/server.gopkg/lifecycle-server/server_test.goscripts/generate_crds_manifests.sh
| tlsProfileChan, err := setupTLSProfileWatcher(mgr, cfg) | ||
| if err != nil { | ||
| return fmt.Errorf("unable to setup TLS profile watcher: %v", err) | ||
| } | ||
| defer close(tlsProfileChan) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm whether SecurityProfileWatcher.OnProfileChange can be invoked after mgr.Start returns
# and whether the upstream implementation treats ctx cancellation for the callback.
rg -nP --type=go -C5 '\bSecurityProfileWatcher\b' vendor/ 2>/dev/null || \
rg -nP --type=go -C5 '\bSecurityProfileWatcher\b'
rg -nP --type=go -C3 'OnProfileChange' vendor/ 2>/dev/null || \
rg -nP --type=go -C3 'OnProfileChange'Repository: openshift/operator-framework-olm
Length of output: 9224
🏁 Script executed:
cat -n cmd/lifecycle-controller/start.go | sed -n '1,100p'Repository: openshift/operator-framework-olm
Length of output: 4200
🏁 Script executed:
cat -n cmd/lifecycle-controller/start.go | sed -n '270,295p'Repository: openshift/operator-framework-olm
Length of output: 1405
🏁 Script executed:
cat -n cmd/lifecycle-controller/start.go | sed -n '180,210p'Repository: openshift/operator-framework-olm
Length of output: 1719
🏁 Script executed:
# Find all usages of tlsProfileChan to understand the data flow
rg -n 'tlsProfileChan' cmd/lifecycle-controller/start.go -A2 -B2Repository: openshift/operator-framework-olm
Length of output: 912
🏁 Script executed:
# Find setupTLSProfileWatcher function definition and channel creation
rg -n 'func setupTLSProfileWatcher' cmd/lifecycle-controller/start.go -A30Repository: openshift/operator-framework-olm
Length of output: 1414
🏁 Script executed:
# Find where tlsChangeChan is created (make call)
rg -n 'make.*tlsChangeChan|make.*event.TypedGenericEvent' cmd/lifecycle-controller/start.go -B5 -A2Repository: openshift/operator-framework-olm
Length of output: 377
🏁 Script executed:
# Find where tlsChangeChan is consumed/read from
rg -n 'for.*tlsProfileChan|<-.*tlsProfileChan' cmd/ -rRepository: openshift/operator-framework-olm
Length of output: 490
🏁 Script executed:
# Find where tlsProfileChan is consumed/read from
rg -n 'for.*range.*tlsProfileChan|<-.*tlsProfileChan' cmd/ -B3 -A3Repository: openshift/operator-framework-olm
Length of output: 58
🏁 Script executed:
# Also check SetupWithManager in the LifecycleServerReconciler
rg -n 'SetupWithManager' pkg/lifecycle-controller/ -A20 | head -80Repository: openshift/operator-framework-olm
Length of output: 3235
Potential panic / goroutine leak: defer close(tlsProfileChan) races with blocking sends from OnProfileChange.
Two failure modes:
- If
OnProfileChangefires while the manager is shutting down, the unbuffered send at line 282 blocks (no consumer — controller-runtime has stopped draining).mgr.Startreturns,defer close(tlsProfileChan)runs, and the send then panics with send on closed channel. - Even without a close, a blocked send holds the watcher goroutine indefinitely, leaking it past
run()returning.
The OnProfileChange callback already receives a context.Context (line 271) — gate the send on ctx.Done() and drop the defer close (leaving it to GC is safe and avoids the race entirely):
🔒️ Suggested fix
- defer close(tlsProfileChan)
+ // Intentionally not closed: producer (SecurityProfileWatcher) may still be mid-send
+ // when mgr.Start returns; rely on GC after run() exits. OnProfileChange: func(ctx context.Context, oldTLSProfileSpec, newTLSProfileSpec configv1.TLSProfileSpec) {
cfg.TLSConfigProvider.UpdateProfile(newTLSProfileSpec)
log.Info("applying new TLS profile spec",
"minVersion", newTLSProfileSpec.MinTLSVersion,
"cipherSuites", newTLSProfileSpec.Ciphers,
)
_, unsupportedCiphers := cfg.TLSConfigProvider.Get()
if len(unsupportedCiphers) > 0 {
log.Info("ignoring unsupported ciphers found in TLS profile", "unsupportedCiphers", unsupportedCiphers)
}
- tlsChangeChan <- event.TypedGenericEvent[configv1.TLSProfileSpec]{Object: newTLSProfileSpec}
+ select {
+ case tlsChangeChan <- event.TypedGenericEvent[configv1.TLSProfileSpec]{Object: newTLSProfileSpec}:
+ case <-ctx.Done():
+ return
+ }
},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/lifecycle-controller/start.go` around lines 89 - 93, The deferred close
of tlsProfileChan in setupTLSProfileWatcher can race with blocking sends from
the OnProfileChange callback and cause panics or leaked goroutines; remove the
defer close(tlsProfileChan) and instead modify the OnProfileChange callback
(which receives a context) to gate its send on tlsProfileChan with a select that
checks ctx.Done() (drop the send if ctx is done) so sends never block or attempt
to send on a closed channel; leave the channel unclosed and let it be GC'd when
no longer referenced.
| func getInitialTLSProfile(ctx context.Context, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) { | ||
| cl, err := client.New(restConfig, client.Options{Scheme: sch}) | ||
| if err != nil { | ||
| return configv1.TLSProfileSpec{}, false, fmt.Errorf("failed to create client: %v", err) | ||
| } | ||
| initialTLSProfileSpec, err := tlsutil.FetchAPIServerTLSProfile(ctx, cl) | ||
| if err != nil { | ||
| return *configv1.TLSProfiles[crypto.DefaultTLSProfileType], false, nil | ||
| } | ||
| return initialTLSProfileSpec, true, nil | ||
| } |
There was a problem hiding this comment.
Silently swallowing FetchAPIServerTLSProfile error masks startup problems.
When tlsutil.FetchAPIServerTLSProfile fails (e.g., transient API server error, missing RBAC), this returns the default profile, disables the TLS profile watcher for the entire process lifetime, and returns nil — there's no log line and the operator has no way to distinguish "no APIServer config" from "RBAC was missing at startup". At minimum, log the error; ideally, bubble it up unless the failure is specifically "resource not found":
🔒️ Suggested fix
-func getInitialTLSProfile(ctx context.Context, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) {
+func getInitialTLSProfile(ctx context.Context, log logr.Logger, restConfig *rest.Config, sch *runtime.Scheme) (configv1.TLSProfileSpec, bool, error) {
cl, err := client.New(restConfig, client.Options{Scheme: sch})
if err != nil {
return configv1.TLSProfileSpec{}, false, fmt.Errorf("failed to create client: %v", err)
}
initialTLSProfileSpec, err := tlsutil.FetchAPIServerTLSProfile(ctx, cl)
if err != nil {
+ log.Error(err, "failed to fetch APIServer TLS profile; falling back to default and disabling TLS profile watcher")
return *configv1.TLSProfiles[crypto.DefaultTLSProfileType], false, nil
}
return initialTLSProfileSpec, true, nil
}(and thread setupLog through from loadStartConfig).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/lifecycle-controller/start.go` around lines 189 - 199,
getInitialTLSProfile currently swallows errors from
tlsutil.FetchAPIServerTLSProfile and silently falls back to the default, which
hides startup RBAC/API errors and disables the watcher; modify
getInitialTLSProfile so that it logs the error (using the setupLog passed down
from loadStartConfig) and returns the error to the caller unless the failure is
a clear "resource not found" case, in which case you may continue to return the
default profile and a false flag; thread setupLog through loadStartConfig into
getInitialTLSProfile and update callers to handle the returned error so
transient API/RBAC failures surface instead of being masked.
| - name: GOMEMLIMIT | ||
| value: "5MiB" | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 10Mi |
There was a problem hiding this comment.
GOMEMLIMIT: 5MiB will starve the Go runtime — almost certainly a typo.
A 5MiB soft-memory limit is below the baseline heap of a controller-runtime binary (manager + multiple informers for CatalogSources/Pods/Deployments/Services/SAs/NetworkPolicies + library-go crypto + apply configurations). The runtime will force back-to-back GC cycles as it approaches the limit, causing severe CPU thrashing and, in practice, startup failure or effective deadlock before any reconcile runs.
For comparison, the lifecycle-server container you create in pkg/lifecycle-controller/controller.go (line 450-451) uses GOMEMLIMIT=50MiB. That value — or higher — is appropriate here too, alongside a correspondingly raised memory request (the current 10Mi also looks too low for a reconciler of this scope).
🔧 Proposed fix
resources:
requests:
cpu: 10m
- memory: 10Mi
+ memory: 80Mi
...
- name: GOMEMLIMIT
- value: "5MiB"
+ value: "80MiB"The same values should be applied to manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: GOMEMLIMIT | |
| value: "5MiB" | |
| resources: | |
| requests: | |
| cpu: 10m | |
| memory: 10Mi | |
| - name: GOMEMLIMIT | |
| value: "80MiB" | |
| resources: | |
| requests: | |
| cpu: 10m | |
| memory: 80Mi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml` around lines
62 - 67, Replace the incorrect tiny Go memory cap by increasing the GOMEMLIMIT
and the container memory request for the lifecycle-controller deployment: update
the env var GOMEMLIMIT from "5MiB" to a realistic value (e.g. "50MiB" or match
the lifecycle-server setting used in pkg/lifecycle-controller/controller.go),
and raise resources.requests.memory (and CPU if needed) under the same container
spec (resources.requests.memory, resources.requests.cpu). Apply the same changes
to the corresponding manifest for IBM Cloud Managed
(0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml) so both
deployments use the higher GOMEMLIMIT and appropriate memory request.
| default: "" | ||
| x-kubernetes-list-map-keys: | ||
| - ip | ||
| x-kubernetes-list-type: map |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether `ip` is already required in this schema object and find similar patterns.
set -euo pipefail
echo "== Locate list-map keys using ip =="
rg -n -C4 'x-kubernetes-list-map-keys:\n\s*-\s*ip' microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
echo
echo "== Locate defaults on ip fields in CRDs =="
rg -n -C3 '^\s*ip:\s*$|^\s*default:\s*""\s*$' microshift-manifests/*.yaml
echo
echo "== Locate required blocks mentioning ip in same file =="
rg -n -C3 '^\s*required:\s*$|^\s*-\s*ip\s*$' microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yamlRepository: openshift/operator-framework-olm
Length of output: 322
🏁 Script executed:
#!/bin/bash
# Verify the ip field and list-map key configuration in the CRD
set -euo pipefail
# Convert to relative path and check file
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"
echo "=== Checking lines around 4715-4718 ==="
sed -n '4710,4725p' "$FILE" | cat -n
echo
echo "=== Search for ip field definitions with defaults ==="
rg -U -n 'ip:.*?\n.*?default' "$FILE" | head -20
echo
echo "=== Search for x-kubernetes-list-map-keys patterns ==="
rg -n 'x-kubernetes-list-map-keys' "$FILE"
echo
echo "=== Check context around line 4715 (50 lines before and after) ==="
sed -n '4665,4765p' "$FILE" | cat -nRepository: openshift/operator-framework-olm
Length of output: 1333
🏁 Script executed:
#!/bin/bash
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"
echo "=== Broader context: lines 4680-4730 ==="
sed -n '4680,4730p' "$FILE" | cat -n
echo
echo "=== Search for 'required:' with following lines to identify scope ==="
sed -n '4660,4750p' "$FILE" | grep -n -A5 'required:' | head -30Repository: openshift/operator-framework-olm
Length of output: 4771
🏁 Script executed:
#!/bin/bash
FILE="microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml"
# Count total occurrences of this contradictory pattern in the file
echo "=== Fields marked as required AND having defaults (potential issues) ==="
python3 << 'PYEOF'
import re
with open(FILE, 'r') as f:
content = f.read()
# Find all object schemas with both required and default patterns
lines = content.split('\n')
issues = []
for i in range(len(lines) - 50):
# Look for required blocks
if re.search(r'^\s*required:\s*$', lines[i]):
# Check next 40 lines for fields with defaults
obj_section = '\n'.join(lines[i:i+50])
if 'default:' in obj_section:
issues.append({
'line': i + 1,
'context': lines[max(0, i-2):min(len(lines), i+15)]
})
if issues:
print(f"Found {len(issues)} objects with required AND default fields")
for issue in issues[:3]: # Show first 3
print(f"\nAround line {issue['line']}:")
for line in issue['context']:
print(f" {line}")
else:
print("No mixed required+default patterns found")
PYEOFRepository: openshift/operator-framework-olm
Length of output: 278
Remove contradictory default from required ip field
The ip field is already marked as required (line 4724) but also has default: "" (line 4715). Required fields should not have defaults—this contradiction can cause unexpected merge and validation behavior. Remove the default value.
Correct schema fix
required:
- ip
properties:
hostnames:
description: Hostnames for the above IP address.
type: array
items:
type: string
x-kubernetes-list-type: atomic
ip:
description: IP address of the host file entry.
type: string
- default: ""
x-kubernetes-list-map-keys:
- ip
x-kubernetes-list-type: map📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| default: "" | |
| x-kubernetes-list-map-keys: | |
| - ip | |
| x-kubernetes-list-type: map | |
| x-kubernetes-list-map-keys: | |
| - ip | |
| x-kubernetes-list-type: map |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml` around
lines 4715 - 4718, The schema for the CRD includes a contradictory default for
the required field "ip": remove the line setting default: "" for the "ip"
property so the "ip" field remains required without a default; locate the "ip"
property in the CRD schema (the block containing x-kubernetes-list-map-keys: -
ip and x-kubernetes-list-type: map) and delete the default entry while leaving
the required list and other metadata intact.
| - name: GOMEMLIMIT | ||
| value: "5MiB" | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 10Mi |
There was a problem hiding this comment.
GOMEMLIMIT: "5MiB" is almost certainly a misconfiguration.
Five mebibytes is not enough memory for the Go runtime itself, let alone a controller-runtime manager with informers, caches, and REST clients. The controller will either OOM at startup or spend most of its time in GC. The memory request of 10Mi is likewise far too low for this workload.
Note: the same values appear in microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml (lines 63-64, 67-68); both should be fixed together.
🐛 Suggested correction (tune to observed usage)
- name: GOMEMLIMIT
- value: "5MiB"
+ value: "200MiB"
resources:
requests:
cpu: 10m
- memory: 10Mi
+ memory: 100Mi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: GOMEMLIMIT | |
| value: "5MiB" | |
| resources: | |
| requests: | |
| cpu: 10m | |
| memory: 10Mi | |
| - name: GOMEMLIMIT | |
| value: "200MiB" | |
| resources: | |
| requests: | |
| cpu: 10m | |
| memory: 100Mi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`
around lines 62 - 67, The GOMEMLIMIT environment variable is misconfigured as
"5MiB" and the pod resource requests are far too low; update the Deployment for
lifecycle-controller to remove or set GOMEMLIMIT to a realistic Go heap cap (for
example "1Gi" or a value tuned from observed usage) and increase
resources.requests.cpu and resources.requests.memory from 10m/10Mi to sensible
defaults (for example 100m CPU and 200Mi memory) and ensure you make the same
change for the other manifest copy that contains the same GOMEMLIMIT and
resource request entries.
|
|
||
| tt := []struct { | ||
| name string | ||
| pods []*corev1.Pod | ||
| expectedImage string | ||
| expectedNode string | ||
| expectErr bool | ||
| }{ | ||
| { | ||
| name: "no pods returns empty", | ||
| pods: nil, | ||
| expectedImage: "", | ||
| expectedNode: "", | ||
| }, | ||
| { | ||
| name: "running pod with digest", | ||
| pods: []*corev1.Pod{ | ||
| catalogPod("test-catalog", "test-ns", "worker-1", "sha256:abc123", corev1.PodRunning), | ||
| }, | ||
| expectedImage: "sha256:abc123", | ||
| expectedNode: "worker-1", | ||
| }, | ||
| { | ||
| name: "pending pod is skipped", | ||
| pods: []*corev1.Pod{ | ||
| catalogPod("test-catalog", "test-ns", "worker-1", "sha256:abc123", corev1.PodPending), | ||
| }, | ||
| expectedImage: "", | ||
| expectedNode: "", | ||
| }, | ||
| { | ||
| name: "running pod with empty imageID is skipped", | ||
| pods: []*corev1.Pod{ | ||
| catalogPod("test-catalog", "test-ns", "worker-1", "", corev1.PodRunning), | ||
| }, | ||
| expectedImage: "", | ||
| expectedNode: "", | ||
| }, | ||
| } | ||
|
|
||
| for _, tc := range tt { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| builder := fake.NewClientBuilder().WithScheme(scheme) | ||
| for _, p := range tc.pods { | ||
| builder = builder.WithObjects(p) | ||
| } | ||
| cl := builder.Build() | ||
|
|
||
| r := testReconciler(cl) | ||
| cs := newCatalogSource("test-catalog", "test-ns", nil) | ||
|
|
||
| image, node, err := r.getCatalogPodInfo(ctx, cs) | ||
| if tc.expectErr { | ||
| require.Error(t, err) | ||
| return | ||
| } | ||
| require.NoError(t, err) | ||
| require.Equal(t, tc.expectedImage, image) | ||
| require.Equal(t, tc.expectedNode, node) | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Unused expectErr field in TestGetCatalogPodInfo.
The expectErr field is declared (line 832) and branched on (line 878-881), but no case in tt sets it to true, so the error path is never exercised. Either add a case that triggers an error (e.g., make the client fail) or remove the field to keep the table honest.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/lifecycle-controller/controller_test.go` around lines 826 - 887, The
table-driven test declares and checks the expectErr field but never sets it;
either remove the unused expectErr field and the conditional branch that checks
tc.expectErr in the test, or add a dedicated test case that sets expectErr=true
and constructs a failing client to force getCatalogPodInfo to return an error.
Locate the test table and loop in TestGetCatalogPodInfo (the tt slice and the
t.Run loop), the tc.expectErr check, and the call to r.getCatalogPodInfo(ctx,
cs); then either delete the expectErr field from the tt struct and the if
tc.expectErr branch, or add a new entry that triggers an error (e.g., create a
client that returns an error when listing Pods) and set expectErr:true so the
error path is exercised.
| func TestTLSConfigProvider_ConcurrentAccess(t *testing.T) { | ||
| p := NewTLSConfigProvider(dummyGetCertificate, tls12Profile()) | ||
|
|
||
| const goroutines = 50 | ||
| var wg sync.WaitGroup | ||
| wg.Add(goroutines * 2) | ||
|
|
||
| // Half the goroutines read, half update | ||
| for i := range goroutines { | ||
| go func() { | ||
| defer wg.Done() | ||
| cfg, _ := p.Get() | ||
| require.NotNil(t, cfg) | ||
| }() | ||
| go func(i int) { | ||
| defer wg.Done() | ||
| var profile configv1.TLSProfileSpec | ||
| if i%2 == 0 { | ||
| profile = tls12Profile() | ||
| } else { | ||
| profile = tls13Profile() | ||
| } | ||
| p.UpdateProfile(profile) | ||
| }(i) | ||
| } | ||
|
|
||
| wg.Wait() | ||
|
|
||
| // Provider should still be functional after concurrent access | ||
| cfg, _ := p.Get() | ||
| require.NotNil(t, cfg) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /dev/null && find . -path "*pkg/lifecycle-controller/tls_test.go" 2>/dev/null | head -1Repository: openshift/operator-framework-olm
Length of output: 126
🏁 Script executed:
git ls-files | grep -i "tls_test.go"Repository: openshift/operator-framework-olm
Length of output: 113
🏁 Script executed:
fd tls_test.goRepository: openshift/operator-framework-olm
Length of output: 113
🏁 Script executed:
git ls-files | head -30Repository: openshift/operator-framework-olm
Length of output: 736
🏁 Script executed:
cat -n pkg/lifecycle-controller/tls_test.go | head -130Repository: openshift/operator-framework-olm
Length of output: 3894
🏁 Script executed:
head -130 pkg/lifecycle-controller/tls_test.go | tail -50Repository: openshift/operator-framework-olm
Length of output: 1223
🏁 Script executed:
sed -n '1,15p' pkg/lifecycle-controller/tls_test.goRepository: openshift/operator-framework-olm
Length of output: 341
🏁 Script executed:
web_search
Go testing t.FailNow from goroutine anti-patternRepository: openshift/operator-framework-olm
Length of output: 168
🌐 Web query:
Go testing FailNow must only be called from test goroutine
💡 Result:
Yes, in Go's testing package, t.FailNow (and equivalent methods like t.Fatal, t.Fatalf) must only be called from the goroutine running the test or benchmark function, not from other goroutines created during the test. Calling it from a non-test goroutine does not properly stop the test execution as intended and can lead to data races (detectable by the race detector), hangs, or skipped cleanup. This is explicitly stated in the official documentation for (*T).FailNow: "FailNow marks the function as having failed and stops its execution by calling runtime.Goexit (which then runs all deferred calls in the current goroutine). Execution will continue at the next test or benchmark. FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test. Calling FailNow does not stop those other goroutines." The same restriction applies to benchmarks (*B).FailNow and fuzzers (*F).FailNow (added in Go 1.18). Instead, from worker goroutines, use t.Error or t.Errorf to report failures without stopping execution, combined with synchronization primitives like sync.WaitGroup to wait for goroutines to complete before the test ends. Example of incorrect usage: func TestBad(t *testing.T) { go func { t.FailNow // Wrong: called from non-test goroutine } } Correct approach: func TestGood(t *testing.T) { var wg sync.WaitGroup wg.Add(1) go func { defer wg.Done // Do work... if err != nil { t.Error("worker error:", err) // Reports failure but doesn't stop goroutine } } wg.Wait } Tools like Go's vet analyzer (GoVetFailNowInNotTestGoroutine) and IDE inspections (e.g., GoLand) detect this issue statically. The race detector may also flag data races when violated dynamically. This rule has been consistent across Go versions, including the latest (Go 1.25.6 as of pkg.go.dev docs).
Citations:
- 1: testing: detect t.FailNow() called from wrong goroutine golang/go#24678
- 2: https://go.dev/issue/24678
- 3: https://pkg.go.dev/testing@go1.25.6
- 4: Calling Called() in a goroutine can result in a call to t.FailNow() stretchr/testify#1701
- 5: https://www.jetbrains.com/help/inspectopedia/GoVetFailNowInNotTestGoroutine.html
- 6: https://pkg.go.dev/testing
🏁 Script executed:
rg "require\." pkg/lifecycle-controller/tls_test.goRepository: openshift/operator-framework-olm
Length of output: 627
🏁 Script executed:
# Check if assert is imported anywhere in the file
grep -n "assert" pkg/lifecycle-controller/tls_test.goRepository: openshift/operator-framework-olm
Length of output: 58
🌐 Web query:
stretchr testify assert vs require difference goroutine
💡 Result:
In the Go testing library stretchr/testify, assert and require provide assertion functions with different behaviors: - assert functions log failures but allow the test to continue, returning a boolean indicating success (true) or failure (false). This enables conditional subsequent assertions, e.g., if assert.NotNil(t, obj) { assert.Equal(t, "expected", obj.Value) }. Multiple failures are reported. - require functions perform the same checks but call t.FailNow on failure, immediately terminating the current test (skipping remaining code). They do not return a value. Key difference with goroutines: require functions must be called from the goroutine running the test function (main test goroutine), not from other goroutines created during the test. Calling require from a non-test goroutine can cause race conditions because t.FailNow is unsafe outside the test goroutine. assert functions are safer in goroutines as they only log failures without terminating. Use assert for non-fatal checks or when testing concurrent code; use require for fatal preconditions in the main test goroutine. Always use sync.WaitGroup or similar to wait for goroutines before assertions.
Citations:
- 1: https://medium.com/freethreads/golang-testify-require-vs-assert-b3bbfb4e0b8f
- 2: https://www.yellowduck.be/posts/assert-vs-require-in-testify
- 3: https://github.com/stretchr/testify/blob/master/README.md
- 4: https://github.com/stretchr/testify/blob/master/require/require.go
- 5: Detect and fail on
require.*methods called from a wrong goroutine stretchr/testify#1499 - 6: Are assertions thread-safe? stretchr/testify#1340
- 7: https://github.com/stretchr/testify
🏁 Script executed:
# Verify the exact behavior difference and check if there are any other similar issues in the file
rg "go func" pkg/lifecycle-controller/tls_test.go -A 5Repository: openshift/operator-framework-olm
Length of output: 297
Avoid calling require.* from non-test goroutines.
require.NotNil calls t.FailNow(), which per Go's testing documentation must only be called from the goroutine running the test function. Calling it from a spawned goroutine can lead to missed or incorrect failure reporting. Use assert.* inside goroutines, then retain require.* only on the test goroutine after wg.Wait().
Update line 103 to use assert.NotNil(t, cfg) and add "github.com/stretchr/testify/assert" to the imports.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/lifecycle-controller/tls_test.go` around lines 91 - 122, In
TestTLSConfigProvider_ConcurrentAccess, avoid calling require.NotNil from
spawned goroutines; change the in-goroutine checks to assert.NotNil(t, cfg) and
keep require.NotNil only on the main test goroutine after wg.Wait(); also add
the import "github.com/stretchr/testify/assert" to the test file so assert is
available. Ensure the references are the require.NotNil call inside the
anonymous reader goroutine in TestTLSConfigProvider_ConcurrentAccess and not the
final require.NotNil after wg.Wait().
| # Copy upstream CRDs directly instead of regenerating with controller-gen | ||
| cp "${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the upstream CRD source is wired into a CI update path and not stale.
fd -t f '.yaml$' staging/api/crds | head -20
echo "---"
# Look for any CI/Makefile target that refreshes staging/api/crds
rg -nP -C2 'staging/api/crds|crds/\*\.yaml' Makefile scripts/ 2>/dev/null | head -40Repository: openshift/operator-framework-olm
Length of output: 929
🏁 Script executed:
# Check if staging/api is a git submodule
cat .gitmodules 2>/dev/null || echo "No .gitmodules file"Repository: openshift/operator-framework-olm
Length of output: 96
🏁 Script executed:
# Search for CI workflows that might handle CRD validation or updates
fd -t f '\.(yml|yaml)$' .github/workflows 2>/dev/null | xargs grep -l 'staging/api\|crds\|controller-gen' 2>/dev/nullRepository: openshift/operator-framework-olm
Length of output: 60
🏁 Script executed:
# Look for any recent changes to the generate_crds_manifests.sh script showing removed yq mutations
git log --oneline -20 scripts/generate_crds_manifests.sh 2>/dev/null | head -10Repository: openshift/operator-framework-olm
Length of output: 153
🏁 Script executed:
# Check the actual structure of staging/api to understand its role
ls -la staging/ 2>/dev/null | head -20Repository: openshift/operator-framework-olm
Length of output: 383
🏁 Script executed:
# Search for any documentation or comments explaining staging/api
rg -i 'staging.*api|upstream.*crd' --type=md README* docs/ 2>/dev/null | head -20Repository: openshift/operator-framework-olm
Length of output: 862
🏁 Script executed:
# Search the entire Makefile and scripts for all CRD-related targets
rg -i 'crd|controller-gen|staging' Makefile scripts/*.sh 2>/dev/null | grep -E '(target|^[^:]*:)' | head -30Repository: openshift/operator-framework-olm
Length of output: 3115
🏁 Script executed:
# Check git history of CRD files to see when they were last touched
git log --oneline -- staging/api/crds/ 2>/dev/null | head -10Repository: openshift/operator-framework-olm
Length of output: 153
🏁 Script executed:
# Look for controller-gen in the codebase
rg 'controller-gen' --type=sh --type=go --type=mod 2>/dev/null | head -20Repository: openshift/operator-framework-olm
Length of output: 58
🏁 Script executed:
# Check if there's a vendor sync or update script
fd -t f -i 'vendor|sync|update' scripts/ | xargs ls -la 2>/dev/nullRepository: openshift/operator-framework-olm
Length of output: 156
🏁 Script executed:
# Search for documentation about how staging/api is maintained
rg -i 'staging.*api|vendor.*api' docs/ 2>/dev/null | head -20Repository: openshift/operator-framework-olm
Length of output: 975
🏁 Script executed:
# Read the actual generate_crds_manifests.sh around line 50 more carefully
sed -n '45,55p' scripts/generate_crds_manifests.shRepository: openshift/operator-framework-olm
Length of output: 474
🏁 Script executed:
# Check what CRDs are being referenced
sed -n '30,70p' scripts/generate_crds_manifests.shRepository: openshift/operator-framework-olm
Length of output: 1596
🏁 Script executed:
# Look at verify_commits.sh to understand how it validates staging
head -50 scripts/verify_commits.shRepository: openshift/operator-framework-olm
Length of output: 1685
🏁 Script executed:
# Check check-staging-vendor.sh
cat scripts/check-staging-vendor.shRepository: openshift/operator-framework-olm
Length of output: 666
🏁 Script executed:
# Look for any git history showing the removal of controller-gen or yq mutations
git log -p --all -- scripts/generate_crds_manifests.sh 2>/dev/null | grep -A5 -B5 'controller-gen\|yq' | head -80Repository: openshift/operator-framework-olm
Length of output: 58
Staged CRD copy relies on upstream commit traceability rather than automated freshness checks.
Direct copying from staging/api/crds does couple the manifest to upstream schema. The codebase does enforce commit-level governance via verify_commits.sh, which validates that all staging changes must reference an upstream commit from a single upstream repository. This provides traceability and prevents accidental drifts, but it is not automatic synchronization.
However, confirm:
- When
staging/api/crdsis updated, what triggers that update? Is it part of a regular upstream sync cadence, or manual cherry-pick? - The removed
controller-genregeneration andyqmutations (referenced in the comment) — document why these fixups are no longer needed, since the CRD schema is now taken verbatim from upstream.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_crds_manifests.sh` around lines 49 - 50, The current copy
step in scripts/generate_crds_manifests.sh simply copies staged CRDs from
"${ROOT_DIR}/staging/api/crds" to "${crddir}/" but lacks documentation about
update triggers and why controller-gen/yq fixups were removed; update the script
header comments to state how staging/api/crds is updated (e.g., via upstream
sync cadence or manual cherry-pick) and reference the existing commit governance
check (verify_commits.sh) that enforces single-upstream commit traceability, and
add a short note explaining that controller-gen regeneration and yq mutations
were removed because CRD manifests are consumed verbatim from upstream and thus
no local schema fixes are required; mention the variables/commands to look at
(cp, ROOT_DIR, crddir, verify_commits.sh) so reviewers can locate the relevant
logic.
| command: | ||
| - /bin/lifecycle-controller | ||
| args: | ||
| - start | ||
| - --catalog-source-field-selector=metadata.namespace=openshift-marketplace,metadata.name=redhat-operators | ||
| - --tls-cert=/var/run/secrets/serving-cert/tls.crt | ||
| - --tls-key=/var/run/secrets/serving-cert/tls.key | ||
| image: quay.io/operator-framework/olm@sha256:de396b540b82219812061d0d753440d5655250c621c753ed1dc67d6154741607 | ||
| imagePullPolicy: IfNotPresent | ||
| env: | ||
| - name: RELEASE_VERSION | ||
| value: "0.0.1-snapshot" | ||
| - name: NAMESPACE | ||
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: metadata.namespace | ||
| - name: LIFECYCLE_SERVER_IMAGE | ||
| value: quay.io/operator-framework/olm@sha256:de396b540b82219812061d0d753440d5655250c621c753ed1dc67d6154741607 | ||
| - name: GOMEMLIMIT | ||
| value: "5MiB" |
There was a problem hiding this comment.
Hardcoded single-catalog field selector and tight GOMEMLIMIT.
Two items on the lifecycle-controller deployment args/env:
- Line 593 pins the controller to reconcile exactly one CatalogSource:
metadata.namespace=openshift-marketplace,metadata.name=redhat-operators. Any other catalog (certified-operators, community-operators, custom CatalogSources) is ignored. If this scoping is just a Tech Preview guardrail, add a comment; if it's meant to be configurable (per‑cluster operator flag, feature gate, etc.), wire it through cluster-olm-operator / a config field rather than hardcoding here. - Line 608 sets
GOMEMLIMIT=5MiBagainst a 10Mi request. Go runtime + controller-runtime caches for CatalogSources, Pods, Deployments, Services, SAs, NetworkPolicies, CRBs, APIServers cluster-wide typically exceed 5MiB resident, which will push GC into a thrash/soft-OOM loop. Please benchmark and raise this (or drop the env var).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_crds_manifests.sh` around lines 589 - 608, The
lifecycle-controller pod spec currently hardcodes a single CatalogSource
selector and an unsafe GOMEMLIMIT; update the deployment to make the catalog
selector configurable (do not hardcode
"metadata.namespace=openshift-marketplace,metadata.name=redhat-operators" in the
args passed to /bin/lifecycle-controller) by wiring a config value/flag in your
cluster-olm-operator reconciler (expose a field or feature gate that populates
the --catalog-source-field-selector arg or leave it unset) and add a clarifying
comment if this is intentionally a tech‑preview guardrail; also remove or raise
the environment GOMEMLIMIT value (currently "5MiB") in the pod env so it is at
least in line with the container memory request (or omit the variable to use Go
defaults) to avoid GC thrashing—update the lifecycle-controller args and the env
entry for GOMEMLIMIT (and any related config fields like RELEASE_VERSION /
LIFECYCLE_SERVER_IMAGE if you surface the selector via config) accordingly.
| cat << EOF > manifests/0000_50_olm_08-lifecycle-controller.service.yaml | ||
| apiVersion: v1 | ||
| kind: Service | ||
| metadata: | ||
| name: lifecycle-controller | ||
| namespace: openshift-operator-lifecycle-manager | ||
| annotations: | ||
| release.openshift.io/feature-set: "TechPreviewNoUpgrade" | ||
| service.beta.openshift.io/serving-cert-secret-name: lifecycle-controller-serving-cert | ||
| spec: | ||
| ports: | ||
| - name: metrics | ||
| port: 8443 | ||
| protocol: TCP | ||
| targetPort: metrics | ||
| selector: | ||
| app: olm-lifecycle-controller | ||
| type: ClusterIP | ||
| EOF |
There was a problem hiding this comment.
Metrics endpoint is TLS but no ServiceMonitor is generated.
The Service annotates service.beta.openshift.io/serving-cert-secret-name and the Deployment passes --tls-cert / --tls-key for port 8443, but unlike the PSM equivalent (lines 287‑308), this script does not emit a ServiceMonitor for lifecycle-controller. Metrics will be scraped by nobody unless Prometheus is configured out-of-band. If monitoring is in-scope for this feature, add a matching servicemonitor.yaml with scheme: https and the serving-cert CA bundle; if it's intentionally deferred, add a TODO.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_crds_manifests.sh` around lines 657 - 675, The Service for
lifecycle-controller exposes TLS metrics on port 8443 but no ServiceMonitor is
emitted; add a ServiceMonitor manifest for the lifecycle-controller that targets
selector app: olm-lifecycle-controller in namespace
openshift-operator-lifecycle-manager, with spec.endpoints containing port:
metrics, scheme: https, and tlsConfig that references the serving cert secret
(secretName: lifecycle-controller-serving-cert) so Prometheus uses the CA bundle
(and set insecureSkipVerify: false); ensure labels/namespace match the Service
and follow the pattern used for the PSM ServiceMonitor (use the same endpoint
name, selector, and namespace); alternatively, if monitoring is intentionally
out-of-scope, add a TODO comment in the script near the lifecycle-controller
Service block documenting the omission.
Co-authored-by: Per Goncalves da Silva <perdasilva@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
scripts/generate_crds_manifests.sh (3)
593-608:⚠️ Potential issue | 🟠 MajorThe controller is still pinned to one namespace and still has an extremely tight memory cap.
--catalog-source-field-selector=metadata.namespace=openshift-marketplacekeeps this deployment from reconciling catalogs outside that namespace, andGOMEMLIMIT=5MiBis still well below the container’s10Mimemory request. If the namespace scoping is intentional, please make that explicit in config or comments; otherwise leave the selector unset. I’d also drop the customGOMEMLIMITor raise it to something realistic for controller-runtime.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_crds_manifests.sh` around lines 593 - 608, Remove or make the hard-coded namespace selector and the tiny memory cap configurable: remove the argument --catalog-source-field-selector=metadata.namespace=openshift-marketplace from the container args (or expose it as a configuration/ENV so namespace scoping is explicit), and either remove the GOMEMLIMIT environment entry or raise it to a realistic value (e.g., >= the container memory request such as 100Mi) so it does not undercut the pod’s requested memory; update the NAMESPACE / comments to document intentional namespace scoping if you choose to keep the selector.
657-675:⚠️ Potential issue | 🟡 MinorTLS metrics are exposed, but there is still no matching
ServiceMonitor.This Service serves HTTPS metrics with a serving cert, but unlike the package-server-manager block at Lines 287-308, the script still does not emit a
ServiceMonitorforlifecycle-controller. If monitoring is intentionally deferred, a TODO would help; otherwise I’d add the monitor here to keep observability consistent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_crds_manifests.sh` around lines 657 - 675, The lifecycle-controller Service exposes HTTPS metrics but no ServiceMonitor is generated; either add a ServiceMonitor manifest for the lifecycle-controller or add a clear TODO comment. Locate the lifecycle-controller Service block (resource name: lifecycle-controller, selector app: olm-lifecycle-controller) in scripts/generate_crds_manifests.sh and emit a ServiceMonitor resource (match selector app: olm-lifecycle-controller, endpoints -> port: metrics, scheme: https, appropriate namespaceSelector and TLS/bearer token settings) analogous to the package-server-manager ServiceMonitor block to ensure Prometheus can scrape it, or add a TODO explaining why monitoring is intentionally omitted.
49-50:⚠️ Potential issue | 🟡 MinorPlease document the new CRD source-of-truth flow.
This still switches downstream CRDs to verbatim copies from
staging/api/crds, but the script does not explain how that staging snapshot is refreshed or why the old localcontroller-gen/yqfixups are no longer needed. A short note pointing maintainers toscripts/verify_commits.shwould make this much easier to reason about next time.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_crds_manifests.sh` around lines 49 - 50, Add a short explanatory comment or README entry describing the new CRD source-of-truth flow: state that scripts/generate_crds_manifests.sh now copies verbatim CRDs from staging/api/crds (the authoritative snapshot), explain how that staging snapshot is refreshed (refer to the commit/automation verified by scripts/verify_commits.sh), and note why the previous controller-gen/yq post-processing is no longer required; include these notes adjacent to the cp "${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/" line so maintainers can quickly find the rationale and the refresh verification script.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 547-556: The TechPreview manifests (e.g.,
manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml) are being copied
unconditionally into microshift-manifests; update the copy/packaging logic to
skip files annotated with release.openshift.io/feature-set:
"TechPreviewNoUpgrade" or explicitly exclude filename patterns 0000_50_olm_08-*
and 0000_50_olm_09-* when populating microshift-manifests so these lifecycle
components remain preview-only; modify the script section that performs the
manifest copy (the microshift-manifests population loop) to filter out those
patterns or check the annotation before copying.
---
Duplicate comments:
In `@scripts/generate_crds_manifests.sh`:
- Around line 593-608: Remove or make the hard-coded namespace selector and the
tiny memory cap configurable: remove the argument
--catalog-source-field-selector=metadata.namespace=openshift-marketplace from
the container args (or expose it as a configuration/ENV so namespace scoping is
explicit), and either remove the GOMEMLIMIT environment entry or raise it to a
realistic value (e.g., >= the container memory request such as 100Mi) so it does
not undercut the pod’s requested memory; update the NAMESPACE / comments to
document intentional namespace scoping if you choose to keep the selector.
- Around line 657-675: The lifecycle-controller Service exposes HTTPS metrics
but no ServiceMonitor is generated; either add a ServiceMonitor manifest for the
lifecycle-controller or add a clear TODO comment. Locate the
lifecycle-controller Service block (resource name: lifecycle-controller,
selector app: olm-lifecycle-controller) in scripts/generate_crds_manifests.sh
and emit a ServiceMonitor resource (match selector app:
olm-lifecycle-controller, endpoints -> port: metrics, scheme: https, appropriate
namespaceSelector and TLS/bearer token settings) analogous to the
package-server-manager ServiceMonitor block to ensure Prometheus can scrape it,
or add a TODO explaining why monitoring is intentionally omitted.
- Around line 49-50: Add a short explanatory comment or README entry describing
the new CRD source-of-truth flow: state that scripts/generate_crds_manifests.sh
now copies verbatim CRDs from staging/api/crds (the authoritative snapshot),
explain how that staging snapshot is refreshed (refer to the commit/automation
verified by scripts/verify_commits.sh), and note why the previous
controller-gen/yq post-processing is no longer required; include these notes
adjacent to the cp "${ROOT_DIR}"/staging/api/crds/*.yaml "${crddir}/" line so
maintainers can quickly find the rationale and the refresh verification script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 26720ce2-1274-47f8-9dc4-dbc658a0d3f2
📒 Files selected for processing (1)
scripts/generate_crds_manifests.sh
| cat << EOF > manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml | ||
| apiVersion: apps/v1 | ||
| kind: Deployment | ||
| metadata: | ||
| name: lifecycle-controller | ||
| namespace: openshift-operator-lifecycle-manager | ||
| labels: | ||
| app: olm-lifecycle-controller | ||
| annotations: | ||
| release.openshift.io/feature-set: "TechPreviewNoUpgrade" |
There was a problem hiding this comment.
TechPreview manifests become unconditional in MicroShift.
These new resources are tagged TechPreviewNoUpgrade, but later this script copies every manifest into microshift-manifests and the ignore list does not exclude the new 0000_50_olm_08-* / 0000_50_olm_09-* files. Because MicroShift does not have CVO feature-set gating here, this ships the lifecycle components by default there instead of keeping them preview-only.
Possible follow-up
files_to_ignore=("ibm-cloud-managed.yaml" "kustomization.yaml" "psm-operator" "removed" "collect-profiles" "pprof" "prometheus" "service-monitor" "operatorstatus")
+files_to_ignore+=(
+ "0000_50_olm_08-lifecycle-controller"
+ "0000_50_olm_09-lifecycle-server"
+)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_crds_manifests.sh` around lines 547 - 556, The TechPreview
manifests (e.g., manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml)
are being copied unconditionally into microshift-manifests; update the
copy/packaging logic to skip files annotated with
release.openshift.io/feature-set: "TechPreviewNoUpgrade" or explicitly exclude
filename patterns 0000_50_olm_08-* and 0000_50_olm_09-* when populating
microshift-manifests so these lifecycle components remain preview-only; modify
the script section that performs the manifest copy (the microshift-manifests
population loop) to filter out those patterns or check the annotation before
copying.
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml (1)
62-67:⚠️ Potential issue | 🔴 CriticalIncrease heap cap and memory request; current values are likely to destabilize the pod.
Line 63 sets
GOMEMLIMITto5MiB, and Line 67 requests only10Mimemory. For a Go controller-runtime process, this is likely to cause startup failures or heavy GC thrashing.🐛 Proposed fix
- name: GOMEMLIMIT - value: "5MiB" + value: "200MiB" resources: requests: - cpu: 10m - memory: 10Mi + cpu: 100m + memory: 200Mi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml` around lines 62 - 67, The GOMEMLIMIT env var is set far too low and the pod memory request is insufficient; update the GOMEMLIMIT value (env name GOMEMLIMIT) to a realistic heap cap (e.g., increase from 5MiB to a larger value appropriate for the controller, e.g., hundreds of MiB) and raise resources.requests.memory (and adjust cpu if needed) in the deployment spec so the Go controller-runtime process can start and run without extreme GC/ooms; locate the env block containing GOMEMLIMIT and the resources.requests section and raise those values consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml`:
- Around line 92-99: The pod is pinned to nodeSelector
node-role.kubernetes.io/control-plane but only tolerates the master taint key
node-role.kubernetes.io/master, so it can remain Pending on clusters that taint
control-plane nodes with node-role.kubernetes.io/control-plane: add a new
toleration entry under tolerations that matches key
node-role.kubernetes.io/control-plane with operator: Exists and effect:
NoSchedule (and optionally effect: NoExecute if you want eviction tolerance) so
the Deployment’s tolerations cover both master and control-plane taints.
---
Duplicate comments:
In `@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml`:
- Around line 62-67: The GOMEMLIMIT env var is set far too low and the pod
memory request is insufficient; update the GOMEMLIMIT value (env name
GOMEMLIMIT) to a realistic heap cap (e.g., increase from 5MiB to a larger value
appropriate for the controller, e.g., hundreds of MiB) and raise
resources.requests.memory (and adjust cpu if needed) in the deployment spec so
the Go controller-runtime process can start and run without extreme GC/ooms;
locate the env block containing GOMEMLIMIT and the resources.requests section
and raise those values consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9e60de6e-01c8-400d-8d42-d0bd42ccb5c0
📒 Files selected for processing (20)
manifests/0000_50_olm_00-catalogsources.crd.yamlmanifests/0000_50_olm_00-clusterserviceversions.crd.yamlmanifests/0000_50_olm_00-installplans.crd.yamlmanifests/0000_50_olm_00-olmconfigs.crd.yamlmanifests/0000_50_olm_00-operatorconditions.crd.yamlmanifests/0000_50_olm_00-operatorgroups.crd.yamlmanifests/0000_50_olm_00-operators.crd.yamlmanifests/0000_50_olm_00-subscriptions.crd.yamlmanifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yamlmanifests/0000_50_olm_08-lifecycle-controller.deployment.yamlmicroshift-manifests/0000_50_olm_00-catalogsources.crd.yamlmicroshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yamlmicroshift-manifests/0000_50_olm_00-installplans.crd.yamlmicroshift-manifests/0000_50_olm_00-olmconfigs.crd.yamlmicroshift-manifests/0000_50_olm_00-operatorconditions.crd.yamlmicroshift-manifests/0000_50_olm_00-operatorgroups.crd.yamlmicroshift-manifests/0000_50_olm_00-operators.crd.yamlmicroshift-manifests/0000_50_olm_00-subscriptions.crd.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yamlmicroshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
✅ Files skipped from review due to trivial changes (16)
- microshift-manifests/0000_50_olm_00-olmconfigs.crd.yaml
- manifests/0000_50_olm_00-olmconfigs.crd.yaml
- manifests/0000_50_olm_00-operatorgroups.crd.yaml
- manifests/0000_50_olm_00-operators.crd.yaml
- microshift-manifests/0000_50_olm_00-installplans.crd.yaml
- microshift-manifests/0000_50_olm_00-operatorconditions.crd.yaml
- microshift-manifests/0000_50_olm_00-operatorgroups.crd.yaml
- manifests/0000_50_olm_00-subscriptions.crd.yaml
- microshift-manifests/0000_50_olm_00-operators.crd.yaml
- manifests/0000_50_olm_00-installplans.crd.yaml
- microshift-manifests/0000_50_olm_00-catalogsources.crd.yaml
- manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml
- manifests/0000_50_olm_00-operatorconditions.crd.yaml
- manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml
- microshift-manifests/0000_50_olm_00-subscriptions.crd.yaml
- manifests/0000_50_olm_00-catalogsources.crd.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
- microshift-manifests/0000_50_olm_00-clusterserviceversions.crd.yaml
| nodeSelector: | ||
| kubernetes.io/os: linux | ||
| node-role.kubernetes.io/control-plane: "" | ||
| tolerations: | ||
| - effect: NoSchedule | ||
| key: node-role.kubernetes.io/master | ||
| operator: Exists | ||
| - effect: NoExecute |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify how control-plane scheduling is handled across manifests.
rg -n --glob '*.yaml' 'node-role\.kubernetes\.io/control-plane' microshift-manifests -C2
rg -n --glob '*.yaml' 'key:\s*node-role\.kubernetes\.io/control-plane' microshift-manifests -C2
rg -n 'nodeSelector:|tolerations:|node-role\.kubernetes\.io/(master|control-plane)' \
microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml -C3Repository: openshift/operator-framework-olm
Length of output: 1229
🏁 Script executed:
cat -n microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml | sed -n '92,105p'Repository: openshift/operator-framework-olm
Length of output: 630
🏁 Script executed:
cat -n microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.yaml | sed -n '91,110p'Repository: openshift/operator-framework-olm
Length of output: 818
Add control-plane taint toleration to avoid Pending pods on control-plane-tainted clusters.
The nodeSelector pins scheduling to node-role.kubernetes.io/control-plane, but the tolerations only include node-role.kubernetes.io/master. On clusters using the standard node-role.kubernetes.io/control-plane:NoSchedule taint on control-plane nodes, this pod will remain unschedulable.
Suggested patch
tolerations:
+ - effect: NoSchedule
+ key: node-role.kubernetes.io/control-plane
+ operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@microshift-manifests/0000_50_olm_08-lifecycle-controller.deployment.ibm-cloud-managed.yaml`
around lines 92 - 99, The pod is pinned to nodeSelector
node-role.kubernetes.io/control-plane but only tolerates the master taint key
node-role.kubernetes.io/master, so it can remain Pending on clusters that taint
control-plane nodes with node-role.kubernetes.io/control-plane: add a new
toleration entry under tolerations that matches key
node-role.kubernetes.io/control-plane with operator: Exists and effect:
NoSchedule (and optionally effect: NoExecute if you want eviction tolerance) so
the Deployment’s tolerations cover both master and control-plane taints.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary by CodeRabbit
New Features
Security
Manifests
Tests
Chores