CNTRLPLANE-3324: Enable gocyclo linter and reduce cyclomatic complexity#8309
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors many large functions across the repository into smaller unexported helper methods, reorganizing control flow while preserving external behavior and public signatures. Areas changed include cluster create flows, AWS/Azure infra orchestration, install validation, multiple controllers (control-plane, hypershift, ignition), metrics/scheduler collectors, and extensive unit tests. Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant CreateInfra as CreateInfra
participant SessionMgr as AWS Session Manager
participant EC2 as AWS EC2
participant R53 as AWS Route53
participant VPC as VPC Resources
participant Proxy as Proxy Helper
CLI->>CreateInfra: start CreateInfra
CreateInfra->>SessionMgr: init AWSSessions
SessionMgr->>EC2: init per-account clients
CreateInfra->>VPC: createVPCResources
VPC-->>CreateInfra: VPC info
CreateInfra->>EC2: createPerZoneResources (subnets, routes, endpoints)
CreateInfra->>R53: LookupPublicZone
R53-->>CreateInfra: hosted zone info
CreateInfra->>CreateInfra: shareSubnets if vpc-owner session exists
CreateInfra->>Proxy: createProxyResources (or return if disabled)
Proxy-->>EC2: create SG / endpoints (wrap SG errors)
CreateInfra-->>CLI: return output / errors
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)
1256-1265:⚠️ Potential issue | 🟠 MajorPopulate
TypeMetain manifest constructors to fix capability-gating lookups.The manifest constructors in
rbac.goinitialize onlyObjectMeta, notTypeMeta. WhengetKey()callsobj.GetObjectKind().GroupVersionKind()on these freshly constructed objects, the Kind field is empty because TypeMeta was never populated. This produces malformed keys (e.g.,"name/"instead of"name/ClusterRole"), causing silent mismatches in theRbacCapabilityMaplookup at line 1331. RBAC capability gating is then bypassed entirely.Each manifest constructor must explicitly set TypeMeta with the appropriate Kind and APIVersion. For example:
func IngressToRouteControllerClusterRole() *rbacv1.ClusterRole { return &rbacv1.ClusterRole{ TypeMeta: metav1.TypeMeta{ Kind: "ClusterRole", APIVersion: rbacv1.SchemeGroupVersion.String(), }, ObjectMeta: metav1.ObjectMeta{ Name: "system:openshift:openshift-controller-manager:ingress-to-route-controller", }, } }Apply this fix to all manifest constructors in
control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/rbac.goand any other manifest files that feed intogetKey().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 1256 - 1265, getKey() relies on objects' TypeMeta but many manifest constructors (e.g., IngressToRouteControllerClusterRole and other constructors in manifests/rbac.go) only set ObjectMeta, leaving Kind/APIVersion empty and producing malformed keys that break the RbacCapabilityMap lookup; fix each manifest constructor that returns Kubernetes types (ClusterRole, Role, ClusterRoleBinding, RoleBinding, etc.) to populate TypeMeta with the correct Kind and APIVersion (use the corresponding SchemeGroupVersion.String() for APIVersion and the literal Kind name) so obj.GetObjectKind().GroupVersionKind() returns valid values and getKey()/RbacCapabilityMap lookups work correctly.hypershift-operator/controllers/scheduler/aws/autoscaler.go (1)
420-432:⚠️ Potential issue | 🟠 MajorDon't mark the empty pair label as taken.
This now inserts
""for any hosted-cluster node that lacksOSDFleetManagerPairedNodesLabel. Downstream,takenPairLabels.Has(machineSetPairLabel(ms))will then reject every unpaired node/machineset, which can block future scale-ups for the unpaired case entirely.Suggested fix
for _, n := range nodes { if n.Labels[hyperv1.HostedClusterLabel] != "" { - takenPairLabels.Insert(n.Labels[OSDFleetManagerPairedNodesLabel]) + if pairLabel := n.Labels[OSDFleetManagerPairedNodesLabel]; pairLabel != "" { + takenPairLabels.Insert(pairLabel) + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/scheduler/aws/autoscaler.go` around lines 420 - 432, collectTakenPairLabels currently inserts an empty string when a hosted-cluster node lacks OSDFleetManagerPairedNodesLabel, causing downstream has checks (e.g., machineSetPairLabel comparisons) to incorrectly reject unpaired resources; update collectTakenPairLabels so it only inserts n.Labels[OSDFleetManagerPairedNodesLabel] when that label exists and is non-empty (i.e., check the paired-label string != "" before calling takenPairLabels.Insert), keeping the existing pod check via podPairLabel as-is.hypershift-operator/controllers/nodepool/capi.go (1)
520-543:⚠️ Potential issue | 🟠 MajorRemoved NodePool labels will stick on existing Machines.
This helper only writes the current
Spec.NodeLabels, so previously propagatedlabelManagedPrefix.*keys are never pruned from existing Machines. Label removals on the NodePool therefore never fully take effect, which can leave stale scheduling selectors on Machines/Nodes indefinitely.Suggested fix
if result, err := controllerutil.CreateOrPatch(ctx, c.Client, &machine, func() error { if machine.Labels == nil { machine.Labels = make(map[string]string) } if machine.Annotations == nil { machine.Annotations = make(map[string]string) } + + for k := range machine.Labels { + if strings.HasPrefix(k, labelManagedPrefix+".") { + delete(machine.Labels, k) + } + } for k, v := range nodePool.Spec.NodeLabels { labelKey := fmt.Sprintf("%s.%s", labelManagedPrefix, k) machine.Labels[labelKey] = v }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/capi.go` around lines 520 - 543, The current CreateOrPatch patcher only adds/updates labelManagedPrefix.* keys from nodePool.Spec.NodeLabels but never removes managed keys that were deleted from the NodePool; update the mutation function passed to controllerutil.CreateOrPatch to prune stale managed labels: scan machine.Labels for keys with the prefix labelManagedPrefix and delete any whose suffix key is not present in nodePool.Spec.NodeLabels, then add/update all keys from nodePool.Spec.NodeLabels as before; also ensure the globalPSNodeLabel key (constructed via fmt.Sprintf("%s.%s", labelManagedPrefix, globalPSNodeLabel)) is set when Platform.Type is AWS/Azure and removed when it is not; leave taints handling unchanged (taintsToJSON and nodePoolAnnotationTaints).
🟠 Major comments (18)
test/e2e/util/dump/journals.go-37-45 (1)
37-45:⚠️ Potential issue | 🟠 MajorTemporary SSH key/script artifacts are not cleaned up
Line [37] and Line [42] create temp artifacts, but
DumpJournalsnever removes them. This leaves private key material and temp scripts on disk after use.💡 Proposed fix
privateKeyFile, err := setupSSHKey(ctx, hc) if err != nil { return err } + defer os.RemoveAll(filepath.Dir(privateKeyFile)) copyJournalFile, err := writeCopyJournalScript() if err != nil { return err } + defer os.Remove(copyJournalFile)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/dump/journals.go` around lines 37 - 45, DumpJournals creates temporary artifacts (privateKeyFile from setupSSHKey and copyJournalFile from writeCopyJournalScript) but never removes them; update DumpJournals to remove these temp files when done and on error: after each successful creation defer removal (or add a cleanup block at function exit) that calls os.Remove for privateKeyFile and copyJournalFile, and ensure deferred removals are set up immediately after each successful call so error paths also clean up; reference the variables privateKeyFile, copyJournalFile and the function DumpJournals when applying the change.test/e2e/util/dump/journals.go-243-246 (1)
243-246:⚠️ Potential issue | 🟠 MajorAvoid defaulting SSH ingress to world-open CIDR
Line [245] falls back to
0.0.0.0/0, which opens SSH access globally ifSSH_SOURCE_CIDRis unset.💡 Proposed fix
sourceCIDR := os.Getenv("SSH_SOURCE_CIDR") if sourceCIDR == "" { - sourceCIDR = "0.0.0.0/0" + return fmt.Errorf("SSH_SOURCE_CIDR must be set; refusing to authorize SSH from 0.0.0.0/0") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/dump/journals.go` around lines 243 - 246, Do not default SSH_SOURCE_CIDR to the world-open "0.0.0.0/0"; instead make sourceCIDR (the SSH_SOURCE_CIDR env var used in test/e2e/util/dump/journals.go) required or default to a safe narrow CIDR (e.g. "127.0.0.1/32" or a private CIDR) and fail fast if unset. Update the code where sourceCIDR is read so it validates the env var, returns an error (or calls t.Fatal / os.Exit in the test helper) when the value is missing/invalid, or uses the safe narrow default, and ensure any place that consumes sourceCIDR handles the error/validated value.test/e2e/util/dump/journals.go-199-217 (1)
199-217:⚠️ Potential issue | 🟠 Major
sgIDis never captured for private-IP pathLine [210] only initializes
sgIDwhenusePublicIPsis true. For private-IP flow,authorizeSSHAccessreceives emptysgIDand becomes a no-op.💡 Proposed fix
for _, reservation := range result.Reservations { for _, instance := range reservation.Instances { if isBastionInstance(instance, hc.Spec.InfraID) { continue } if string(instance.State.Name) != "running" { continue } + if sgID == "" && len(instance.SecurityGroups) > 0 { + sgID = aws.ToString(instance.SecurityGroups[0].GroupId) + } if usePublicIPs { if aws.ToString(instance.PublicIpAddress) != "" { - if sgID == "" && len(instance.SecurityGroups) > 0 { - sgID = aws.ToString(instance.SecurityGroups[0].GroupId) - } machineIPs = append(machineIPs, aws.ToString(instance.PublicIpAddress)) } } else { machineIPs = append(machineIPs, aws.ToString(instance.PrivateIpAddress)) } machineInstances = append(machineInstances, instance) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/dump/journals.go` around lines 199 - 217, The loop over EC2 instances only sets sgID inside the usePublicIPs branch so authorizeSSHAccess gets an empty sgID for private-IP flows; move or duplicate the sgID assignment so that whenever sgID == "" and instance.SecurityGroups has entries you set sgID = aws.ToString(instance.SecurityGroups[0].GroupId) regardless of usePublicIPs (inside the same for _, instance range before appending IPs), ensuring authorizeSSHAccess receives the captured security group ID; reference variables/functions: sgID, machineIPs, authorizeSSHAccess, isBastionInstance.test/e2e/util/dump/journals.go-164-173 (1)
164-173:⚠️ Potential issue | 🟠 MajorBastion teardown timing moved too late in lifecycle
Line [164] registers bastion destruction with
t.Cleanup, butDumpJournalsis already called from cleanup (test/e2e/util/fixture.go, Line [410]). This can delay teardown ordering and run with a stale/canceledctx, causing bastion leaks.💡 Proposed fix
-func setupBastion(t *testing.T, ctx context.Context, hc *hyperv1.HostedCluster, awsCreds string, createLogger, destroyLogger *zap.Logger) (string, error) { +func setupBastion(t *testing.T, ctx context.Context, hc *hyperv1.HostedCluster, awsCreds string, createLogger, destroyLogger *zap.Logger) (string, func(), error) { if hc.Annotations[hyperv1.AWSMachinePublicIPs] == "true" { - return "", nil + return "", nil, nil } createBastion := bastionaws.CreateBastionOpts{ Namespace: hc.Namespace, Name: hc.Name, AWSCredentialsFile: awsCreds, Wait: true, } _, bastionIP, err := createBastion.Run(ctx, zapr.NewLoggerWithOptions(createLogger)) if err != nil { - return "", err + return "", nil, err } - t.Cleanup(func() { + cleanup := func() { destroyBastion := bastionaws.DestroyBastionOpts{ Namespace: hc.Namespace, Name: hc.Name, AWSCredentialsFile: awsCreds, } - if err := destroyBastion.Run(ctx, zapr.NewLoggerWithOptions(destroyLogger)); err != nil { + if err := destroyBastion.Run(context.Background(), zapr.NewLoggerWithOptions(destroyLogger)); err != nil { t.Logf("error destroying bastion: %v", err) } - }) - return bastionIP, nil + } + return bastionIP, cleanup, nil }- bastionIP, err := setupBastion(t, ctx, hc, awsCreds, createLogger, destroyLogger) + bastionIP, cleanupBastion, err := setupBastion(t, ctx, hc, awsCreds, createLogger, destroyLogger) if err != nil { return err } + if cleanupBastion != nil { + defer cleanupBastion() + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/dump/journals.go` around lines 164 - 173, The bastion destruction is being deferred via t.Cleanup which runs after DumpJournals (registered elsewhere) and may use a canceled ctx; instead, execute the teardown synchronously or register it earlier so it runs before DumpJournals. Locate the bastionaws.DestroyBastionOpts creation (destroyBastion) in DumpJournals/journals.go and replace the t.Cleanup wrapper with an immediate call to destroyBastion.Run(ctx, zapr.NewLoggerWithOptions(destroyLogger)) (handling/logging the error), or move the t.Cleanup registration to execute before DumpJournals is registered (ensure the destroy registration occurs earlier in the test lifecycle) so the bastion is torn down while ctx is still valid.control-plane-operator/controllers/azureprivatelinkservice/controller.go-960-974 (1)
960-974:⚠️ Potential issue | 🟠 MajorDon't tear down the shared base-domain VNet link before checking for siblings.
reconcileDNSZonealready handles the case where only one sibling CR actually owns the base-domain VNet link. Here we always try to deletebaseDomainVNetLinkName(azPLS.Name)beforehasSiblingCR()runs, so if the deleting CR is the one that created the real link, the surviving sibling loses name resolution even though we keep the zone. Please gate shared-link teardown on the sibling check, or otherwise resolve and delete the actual shared link only when the last sibling is gone.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go` around lines 960 - 974, The code deletes the base-domain VNet link (deleteVNetLink with baseDomainVNetLinkName(azPLS.Name)) before checking for sibling CRs (hasSiblingCR), which can break name resolution for surviving siblings; move or gate the shared-link teardown so it only runs after confirming there are no sibling CRs (call hasSiblingCR first and only invoke deleteVNetLink/deleteDNSZone when hasSiblings is false), or alternatively resolve the actual shared link owner and delete it only when the last sibling is gone (adjust reconcileDNSZone logic to perform the shared-link deletion after the sibling check).control-plane-operator/main.go-181-223 (1)
181-223:⚠️ Potential issue | 🟠 MajorPreserve the caller's context in
newOperatorImageLookup.Line 187 switches these API reads to
context.Background(), so the image lookup no longer honors the startup/shutdown context or the 30sPollUntilContextTimeoutbudget. A stuck apiserver read can now block this path past the caller's timeout. Please thread the caller'sctxinto the helper instead of creating a background context here.Suggested fix
-func newOperatorImageLookup(mgr ctrl.Manager, namespace, deploymentName string) func(string) (string, error) { +func newOperatorImageLookup(ctx context.Context, mgr ctrl.Manager, namespace, deploymentName string) func(string) (string, error) { return func(userSpecifiedImage string) (string, error) { if len(userSpecifiedImage) > 0 { setupLog.Info("using image from arguments", "image", userSpecifiedImage) return userSpecifiedImage, nil } - ctx := context.Background() if podName := os.Getenv("POD_NAME"); podName != "" { me := &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Namespace: namespace, Name: podName}} if err := mgr.GetAPIReader().Get(ctx, crclient.ObjectKeyFromObject(me), me); err != nil { return "", fmt.Errorf("failed to get operator pod %s: %w", crclient.ObjectKeyFromObject(me), err) } @@ - lookupOperatorImage := newOperatorImageLookup(mgr, namespace, deploymentName) + lookupOperatorImage := newOperatorImageLookup(ctx, mgr, namespace, deploymentName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/main.go` around lines 181 - 223, The returned lookup currently creates a background context and ignores the caller's timeout; change newOperatorImageLookup to accept a context from the caller by altering its returned function signature to func(ctx context.Context, userSpecifiedImage string) (string, error) and replace ctx := context.Background() with the provided ctx so all mgr.GetAPIReader().Get calls (and the pod/deployment lookups in the closure) honor the caller's cancellation/deadline (references: newOperatorImageLookup, the closure's Get calls, the podName branch, and the deployment lookup).ignition-server/controllers/local_ignitionprovider.go-544-550 (1)
544-550:⚠️ Potential issue | 🟠 MajorPropagate early
machine-config-serverexits instead of timing out.The goroutine only logs
CombinedOutput(). Ifmachine-config-serverdies before serving/config/master, the poll loop keeps retrying until the outer context expires, so callers lose the real stderr and get a misleading timeout. Feed the subprocess result back into the poll loop and fail as soon as the process exits unexpectedly.Also applies to: 556-590
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 544 - 550, The goroutine that starts machine-config-server currently only logs cmd.CombinedOutput() and doesn't propagate process exit to the poll loop; change it to send the subprocess result back so the poll loop can fail fast. Create a channel (e.g., cmdErrChan) and in the goroutine run out, err := cmd.CombinedOutput(); send a wrapped error (including stdout/stderr) on cmdErrChan (or nil on success), then in the poll loop select on ctx.Done(), time.After(...), and cmdErrChan so that if machine-config-server exits early the function returns the subprocess error immediately. Apply the same change for the duplicate block covering the other machine-config-server start (the section around lines 556-590).ignition-server/controllers/local_ignitionprovider.go-352-363 (1)
352-363:⚠️ Potential issue | 🟠 MajorClose the extracted binary file on every error path.
If Line 356 or Line 359 fails,
file.Close()is never reached. On repeated retries that can leak descriptors in the controller process. Wrap this block so the file is closed viadeferimmediately afteros.Create.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 352 - 363, The created file handle from os.Create (variable file) can leak on the error paths in the block around file.Chmod and p.ImageFileCache.extractImageFile; immediately after the os.Create call in the same scope, add a deferred close (defer func() { _ = file.Close() }()) so the file is always closed on any early return, and then remove or adapt the explicit file.Close() later (or check and return any Close error) to avoid double-close handling; keep the reframing local to this block around os.Create, file.Chmod, p.ImageFileCache.extractImageFile and file.Close.ignition-server/controllers/local_ignitionprovider.go-266-267 (1)
266-267:⚠️ Potential issue | 🟠 MajorRestrict permissions on the kubeconfig and pull secret files.
Lines 266-267 and 404-405 write credentials with
0644, which makes them readable to any user that can access the workdir. These files only need to be readable by the current process and its child processes, so they should be0600.Suggested fix
- if err := os.WriteFile(filepath.Join(dirs.mcsDir, "kubeconfig"), bootstrapKubeConfig, 0644); err != nil { + if err := os.WriteFile(filepath.Join(dirs.mcsDir, "kubeconfig"), bootstrapKubeConfig, 0600); err != nil { return fmt.Errorf("failed to write bootstrap kubeconfig: %w", err) }- if err = os.WriteFile(fmt.Sprintf("%s/pull-secret.yaml", dirs.configDir), []byte(serializedPullSecret), 0644); err != nil { + if err = os.WriteFile(fmt.Sprintf("%s/pull-secret.yaml", dirs.configDir), []byte(serializedPullSecret), 0600); err != nil { return fmt.Errorf("failed to write pull secret to config dir: %w", err) }Also applies to: 404-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 266 - 267, Change the file write permissions for the credential files written with os.WriteFile to be owner-only readable/writable (0600) instead of 0644: update the os.WriteFile call that writes filepath.Join(dirs.mcsDir, "kubeconfig") with bootstrapKubeConfig to use file mode 0600, and likewise update the other os.WriteFile call (the pull secret write at the second occurrence around lines 404-405) to use 0600 so both kubeconfig and pull secret are only accessible to the current user/process.control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go-578-586 (1)
578-586:⚠️ Potential issue | 🟠 MajorDon’t return from the whole helper on a registry
Getfailure.A non-
NotFounderror at Line 585 exitsreconcileRegistryAndIngressbefore the additional trusted CA sync, ingress reconciliation, andreconcileAuthOIDCrun. The surrounding refactor moved toward aggregating sub-step failures, so this now blocks unrelated resources behind a registry read error.Suggested fix
func (r *reconciler) reconcileRegistryAndIngress(ctx context.Context, hcp *hyperv1.HostedControlPlane, log logr.Logger) []error { var errs []error + registryConfigReadable := true registryConfig := manifests.Registry() var registryConfigExists bool if err := r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig); err != nil { if !apierrors.IsNotFound(err) { - return []error{fmt.Errorf("failed to get registry config: %w", err)} + errs = append(errs, fmt.Errorf("failed to get registry config: %w", err)) + registryConfigReadable = false } } else { registryConfigExists = true } - if capabilities.IsImageRegistryCapabilityEnabled(hcp.Spec.Capabilities) { + if registryConfigReadable && capabilities.IsImageRegistryCapabilityEnabled(hcp.Spec.Capabilities) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 578 - 586, The helper reconcileRegistryAndIngress currently returns early when r.client.Get for registryConfig errors (non-NotFound), which prevents later steps (trusted CA sync, ingress reconciliation, reconcileAuthOIDC) from running; instead, change the error handling for r.client.Get to append the formatted error to errs (e.g., errs = append(errs, fmt.Errorf("failed to get registry config: %w", err))) and continue execution so subsequent reconciliation steps still run, while preserving the registryConfigExists boolean semantics (set true only when Get succeeds) and leaving NotFound treated as a non-error.hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go-88-107 (1)
88-107:⚠️ Potential issue | 🟠 MajorPropagate annotation-repair patch failures instead of continuing on stale state.
Line 101 and Line 122 currently log patch failures and continue. That can preserve corrupted annotations and cause repeated snapshots for the same restart progression.
🔧 Suggested fix
-func (r *SnapshotReconciler) getLastObservedRestartCount(ctx context.Context, pod *corev1.Pod, log logr.Logger) int32 { +func (r *SnapshotReconciler) getLastObservedRestartCount(ctx context.Context, pod *corev1.Pod, log logr.Logger) (int32, error) { val, ok := pod.Annotations[lastObservedRestartCountAnnotation] if !ok { - return 0 + return 0, nil } count, err := parseInt32(val) if err != nil { log.V(1).Info("Failed to parse last observed restart count annotation, resetting to 0", "annotationValue", val, "error", err) podCopy := pod.DeepCopy() @@ - if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil { - log.Error(patchErr, "Failed to reset corrupted annotation") - } - return 0 + if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil { + return 0, fmt.Errorf("failed to reset corrupted %s annotation: %w", lastObservedRestartCountAnnotation, patchErr) + } + return 0, nil } - return count + return count, nil } -func (r *SnapshotReconciler) checkSnapshotInterval(ctx context.Context, pod *corev1.Pod, spec *auditlogpersistencev1alpha1.AuditLogPersistenceConfigSpec, log logr.Logger) (shouldSnapshot bool, skipReconcile bool) { +func (r *SnapshotReconciler) checkSnapshotInterval(ctx context.Context, pod *corev1.Pod, spec *auditlogpersistencev1alpha1.AuditLogPersistenceConfigSpec, log logr.Logger) (shouldSnapshot bool, skipReconcile bool, err error) { @@ - if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil { - log.Error(patchErr, "Failed to remove corrupted last snapshot time annotation") - } - return true, false + if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil { + return false, false, fmt.Errorf("failed to remove corrupted %s annotation: %w", lastSnapshotTimeAnnotation, patchErr) + } + return true, false, nil } @@ - return true, false + return true, false, nil } if time.Since(lastSnapshotTime) >= minInterval { - return true, false + return true, false, nil } - return false, true + return false, true, nil }- lastObservedRestartCount := r.getLastObservedRestartCount(ctx, pod, log) + lastObservedRestartCount, err := r.getLastObservedRestartCount(ctx, pod, log) + if err != nil { + return ctrl.Result{}, err + } @@ - shouldSnapshot, skipReconcile := r.checkSnapshotInterval(ctx, pod, spec, log) + shouldSnapshot, skipReconcile, err := r.checkSnapshotInterval(ctx, pod, spec, log) + if err != nil { + return ctrl.Result{}, err + }Also applies to: 109-136
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go` around lines 88 - 107, The getLastObservedRestartCount function (SnapshotReconciler.getLastObservedRestartCount) currently swallows r.client.Patch errors and continues, leaving corrupted annotations; change its signature to return (int32, error), attempt the annotation reset as now, and if r.client.Patch(ctx, podCopy, client.MergeFrom(pod)) fails return 0 and the patch error (wrap with context) instead of only logging; update all callers to handle/propagate the error (or requeue) so the reconcile loop can retry; apply the same change/behavior to the other similar block handling the lastObservedRestartCountAnnotation in the file (lines ~109-136), ensuring parseInt32 parsing errors lead to a patch attempt and any patch failure is returned upward.test/e2e/util/util.go-2493-2510 (1)
2493-2510:⚠️ Potential issue | 🟠 MajorMove the fallback out of the service loop.
The
defaultbranch runs for every non-APIServerservice, so it can overwrite a valid destination found earlier. It also means anAPIServerentry with an empty hostname never falls back unless some other service happens to be present later in the slice.[suggested fix]
Proposed fix
func findExternalNameDestination(t *testing.T, ctx context.Context, g Gomega, mgmtClient crclient.Client, hc *hyperv1.HostedCluster, hcpNamespace string) string { - var dest string for _, service := range hc.Spec.Services { - fmt.Printf("service: %+v\n", service) - switch service.Service { - case hyperv1.APIServer: - dest = apiServerExternalName(service) - default: - t.Log("service custom DNS name not found, using the control plane endpoint") - hcp := &hyperv1.HostedControlPlane{} - err := mgmtClient.Get(ctx, types.NamespacedName{Namespace: hcpNamespace, Name: hc.Name}, hcp) - g.Expect(err).NotTo(HaveOccurred(), "failed to get updated HostedControlPlane") - g.Expect(hcp.Status.ControlPlaneEndpoint.Host).NotTo(BeEmpty(), "failed to get the control plane endpoint") - dest = hcp.Status.ControlPlaneEndpoint.Host - } + if service.Service != hyperv1.APIServer { + continue + } + if dest := apiServerExternalName(service); dest != "" { + return dest + } + break } - g.Expect(dest).NotTo(BeEmpty(), "failed to get the external name destination") - return dest + + t.Log("service custom DNS name not found, using the control plane endpoint") + hcp := &hyperv1.HostedControlPlane{} + err := mgmtClient.Get(ctx, types.NamespacedName{Namespace: hcpNamespace, Name: hc.Name}, hcp) + g.Expect(err).NotTo(HaveOccurred(), "failed to get updated HostedControlPlane") + g.Expect(hcp.Status.ControlPlaneEndpoint.Host).NotTo(BeEmpty(), "failed to get the control plane endpoint") + return hcp.Status.ControlPlaneEndpoint.Host }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/util.go` around lines 2493 - 2510, The fallback that reads the HostedControlPlane and sets dest should be moved out of the loop in findExternalNameDestination so it does not run for every non-APIServer service and overwrite a previously found value; iterate hc.Spec.Services, set dest only when service.Service == hyperv1.APIServer (using apiServerExternalName(service)) and break when a non-empty dest is found, then after the loop, if dest is empty perform the mgmtClient.Get(...) on HostedControlPlane and set dest = hcp.Status.ControlPlaneEndpoint.Host with the same g.Expect checks.hypershift-operator/controllers/hostedcluster/etcd_recovery.go-48-55 (1)
48-55:⚠️ Potential issue | 🟠 MajorDon’t fall through while a recovery Job is still running.
Line 49 only logs “waiting” for an unfinished Job, but Line 54 still re-enters detection/trigger logic. The outer reconcile loop only keeps polling when this method returns a
requeueAfter, so an in-flight recovery can stop being watched as soon as ETCD looks healthy, leaving cleanup andEtcdRecoveryActive=Falseto chance.Also applies to: 57-60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines 48 - 55, When a recovery Job is still running you must not proceed into detection/trigger logic; after calling handleExistingEtcdRecoveryJob (and when jobStatus.exists is true and indicates in-flight) return immediately with a requeue result so the outer reconcile continues watching instead of calling detectAndTriggerEtcdRecovery; apply the same early-return/requeue behavior to the other analogous block (lines 57-60) that currently falls through. Ensure you detect the "running"/in-flight state from jobStatus and return a requeue rather than calling detectAndTriggerEtcdRecovery or further logic.hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go-169-170 (1)
169-170:⚠️ Potential issue | 🟠 MajorKeep the requeue in the “no-op” transition paths.
Line 425 and Line 465 return
&noOpActionprecisely when the delay/window has not elapsed but the conditions already match. That sentinel carries norequeueAfter, and Line 342/Line 349 then suppress it entirely, so the controller can stop reconciling before the pending resize is actually allowed to proceed.Suggested direction
- // Conditions already match status; no update needed but delay has not elapsed, so stop processing. - return &noOpAction + // Conditions already match status; no update needed, but we still must wake up + // when the delay expires so the transition can continue. + return &action{requeueAfter: delayStart.Add(delay).Sub(r.now())}- if result == &noOpAction { - return nil, nil - } return result, nilDo the same for the concurrency-limit path by returning a plain
action{requeueAfter: durationUntilChanges}even when no status patch is needed.Also applies to: 341-352, 381-425, 428-465
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go` around lines 169 - 170, The controller returns the sentinel noOpAction (var noOpAction = action{}) on transition paths where a delay/window hasn't elapsed, which carries no requeueAfter and prevents the reconcile from being requeued; instead, in those paths (including the concurrency-limit branch) return an action with requeueAfter set to the computed duration (e.g. return action{requeueAfter: durationUntilChanges}) rather than &noOpAction so the controller will requeue until the delay/window has passed; update all places that currently return &noOpAction when the only reason to stop is timing or concurrency (keep using the existing action struct and durationUntilChanges variable name) so downstream logic still patches nothing but schedules a requeue.cmd/cluster/core/create.go-467-476 (1)
467-476:⚠️ Potential issue | 🟠 Major
--pausedUntil=trueis ignored by the new helper.Line 468 treats
"true"the same as “unset” and returns before assigningcluster.Spec.PausedUntil. That drops the documented “pause until the field is removed” behavior for the boolean form.Suggested fix
func applyPausedUntil(cluster *hyperv1.HostedCluster, opts *CreateOptions) error { - if len(opts.PausedUntil) == 0 || opts.PausedUntil == "true" { + if len(opts.PausedUntil) == 0 { return nil } + if opts.PausedUntil == "true" { + cluster.Spec.PausedUntil = &opts.PausedUntil + return nil + } _, err := time.Parse(time.RFC3339, opts.PausedUntil) if err != nil { return fmt.Errorf("invalid pausedUntil value, should be \"true\" or a valid RFC3339 date format: %w", err) } cluster.Spec.PausedUntil = &opts.PausedUntil return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create.go` around lines 467 - 476, The helper currently treats opts.PausedUntil == "true" as unset and returns early, dropping the documented boolean behavior; change applyPausedUntil so that when opts.PausedUntil == "true" you explicitly set cluster.Spec.PausedUntil to a pointer to an empty string (e.g. empty := ""; cluster.Spec.PausedUntil = &empty) and return nil, keep the existing len(opts.PausedUntil) == 0 early return for unset, and only run time.Parse when opts.PausedUntil is neither empty nor "true".hypershift-operator/controllers/nodepool/aws.go-154-175 (1)
154-175:⚠️ Potential issue | 🟠 MajorKeep the default EBS type in the nil
RootVolumepath.
buildAWSRootVolumeonly appliesEC2VolumeDefaultTypewhennodePool.Spec.Platform.AWS.RootVolumeis present. If the field is omitted entirely, this helper now returns a volume withType == "", so NodePools relying on the controller default no longer get the same machine template.Suggested fix
func buildAWSRootVolume(nodePool *hyperv1.NodePool) *capiaws.Volume { rootVolume := &capiaws.Volume{ - Size: EC2VolumeDefaultSize, + Size: EC2VolumeDefaultSize, + Type: capiaws.VolumeType(EC2VolumeDefaultType), } if nodePool.Spec.Platform.AWS.RootVolume == nil { return rootVolume } if nodePool.Spec.Platform.AWS.RootVolume.Type != "" { rootVolume.Type = capiaws.VolumeType(nodePool.Spec.Platform.AWS.RootVolume.Type) - } else { - rootVolume.Type = capiaws.VolumeType(EC2VolumeDefaultType) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/aws.go` around lines 154 - 175, buildAWSRootVolume currently returns a rootVolume with Size set to EC2VolumeDefaultSize but leaves Type empty when nodePool.Spec.Platform.AWS.RootVolume is nil; update the function (buildAWSRootVolume) to set rootVolume.Type to capiaws.VolumeType(EC2VolumeDefaultType) in the nil RootVolume path (or initialize Type when creating rootVolume) so the default EBS type is applied consistently even when RootVolume is omitted.control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go-799-816 (1)
799-816:⚠️ Potential issue | 🟠 MajorAvailability stops checking component health after the first success.
Once
HostedControlPlaneAvailablebecomes true,controlPlaneComponentsAvailableis never called again. A later component regression will therefore keepStatus.Ready/HostedControlPlaneAvailablegreen even though the control plane is no longer fully available.Suggested fix
- availableCondition := meta.FindStatusCondition(hostedControlPlane.Status.Conditions, string(hyperv1.HostedControlPlaneAvailable)) - alreadyAvailable := availableCondition != nil && availableCondition.Status == metav1.ConditionTrue - if !alreadyAvailable { - componentsNotAvailableMsg, componentsErr = r.controlPlaneComponentsAvailable(ctx, hostedControlPlane) - } + componentsNotAvailableMsg, componentsErr = r.controlPlaneComponentsAvailable(ctx, hostedControlPlane)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go` around lines 799 - 816, The code skips calling controlPlaneComponentsAvailable once HostedControlPlaneAvailable is true, so regressions aren't detected; always invoke controlPlaneComponentsAvailable (instead of conditioning it on alreadyAvailable) and feed its returned componentsNotAvailableMsg and componentsErr into reconcileAvailabilityStatus (the functions and symbols to update are controlPlaneComponentsAvailable, reconcileAvailabilityStatus, and the use of hostedControlPlane.Status.Conditions/HostedControlPlaneAvailable) so component health is rechecked every reconciliation and the HostedControlPlaneAvailable/Status.Ready can transition back to false when needed.test/e2e/util/reqserving/verifycp.go-238-267 (1)
238-267:⚠️ Potential issue | 🟠 MajorThe timeout fallback can silently pass or panic on missing requests.
If the expected container is absent, this loop appends nothing and
verifyResourceRequestscan returnnilafter a timeout. Also, unlikecontainerResourcesMatch, the fallback path callsMemory().Cmp(...)/Cpu().Cmp(...)without nil checks, so a missing request can panic instead of producing a useful assertion error.Suggested fix
for _, effect := range effects.ResourceRequests { workloadKind := "deployment" if effect.DeploymentName == "etcd" { workloadKind = "statefulset" } containers, getErr := getContainersForEffect(ctx, client, cpNamespace, effect) if getErr != nil { errs = append(errs, fmt.Errorf("failed to get %s %s: %w", workloadKind, effect.DeploymentName, getErr)) continue } + foundContainer := false for _, container := range containers { if container.Name != effect.ContainerName { continue } + foundContainer = true if effect.Memory != nil { - if container.Resources.Requests.Memory().Cmp(*effect.Memory) != 0 { + if mr := container.Resources.Requests.Memory(); mr == nil || mr.Cmp(*effect.Memory) != 0 { errs = append(errs, fmt.Errorf("%s %s has memory request set to %v, expected %v", workloadKind, effect.DeploymentName, container.Resources.Requests.Memory(), *effect.Memory)) } } if effect.CPU != nil { - if container.Resources.Requests.Cpu().Cmp(*effect.CPU) != 0 { + if cr := container.Resources.Requests.Cpu(); cr == nil || cr.Cmp(*effect.CPU) != 0 { errs = append(errs, fmt.Errorf("%s %s has cpu request set to %v, expected %v", workloadKind, effect.DeploymentName, container.Resources.Requests.Cpu(), *effect.CPU)) } } break } + if !foundContainer { + errs = append(errs, fmt.Errorf("%s %s does not contain container %s", workloadKind, effect.DeploymentName, effect.ContainerName)) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/reqserving/verifycp.go` around lines 238 - 267, reportResourceMismatches currently can silently succeed or panic when the expected container or resource requests are missing; update reportResourceMismatches to (1) detect when no matching container was found for an effect and append a descriptive error (so verifyResourceRequests won’t return nil on timeout), and (2) guard access to container.Resources.Requests memory/cpu before calling Cmp — if a request is absent, append a clear error instead of calling Memory().Cmp or Cpu().Cmp. Reference the reportResourceMismatches function and the effect loop (effects.ResourceRequests / effect.ContainerName) and make the error messages mirror the style used by containerResourcesMatch so assertions remain informative.
🟡 Minor comments (10)
cmd/infra/aws/create.go-259-259 (1)
259-259:⚠️ Potential issue | 🟡 MinorRemove
endpointRouteTableIdsfrom the return signature ofcreatePerZoneResourcesor use it in the caller.The
endpointRouteTableIdsreturn value is computed and returned bycreatePerZoneResources(line 382) but immediately discarded by the caller at line 259. While the value is used within the helper function itself (passed toCreateVPCS3Endpoint), there's no reason to return it if the caller doesn't need it. Either remove it from the return signature to clean up the API, or if there's a specific reason for returning it, use it in theCreateInframethod.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/aws/create.go` at line 259, The helper createPerZoneResources currently builds and returns endpointRouteTableIds which the caller in CreateInfra ignores; remove endpointRouteTableIds from createPerZoneResources' return signature and from its internal returned tuple, and update the CreateInfra call site to stop discarding that value (remove the placeholder assignment). Ensure CreateVPCS3Endpoint still receives the routeTableIds inside createPerZoneResources where it’s needed and update any related variable names/returns accordingly so no unused value is returned.hypershift-operator/controllers/nodepool/metrics/metrics.go-430-433 (1)
430-433:⚠️ Potential issue | 🟡 MinorSame nil pointer risk as
collectMachineSetReplicas.
MachineDeployment.Spec.Replicasis also a pointer; apply the same guard.🛡️ Suggested fix
for k := range machineDeployments.Items { md := &machineDeployments.Items[k] - result[md.Namespace+"/"+md.Name] = *md.Spec.Replicas + replicas := int32(0) + if md.Spec.Replicas != nil { + replicas = *md.Spec.Replicas + } + result[md.Namespace+"/"+md.Name] = replicas }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/metrics/metrics.go` around lines 430 - 433, The loop that sets result[md.Namespace+"/"+md.Name] = *md.Spec.Replicas dereferences md.Spec.Replicas without checking for nil (same risk as in collectMachineSetReplicas); update the loop in metrics.go (the machineDeployments iteration where md := &machineDeployments.Items[k]) to guard against nil by checking md.Spec != nil && md.Spec.Replicas != nil before dereferencing, and only assign to the result map when the pointer is non-nil (otherwise skip or set a safe default).hypershift-operator/controllers/nodepool/metrics/metrics.go-417-420 (1)
417-420:⚠️ Potential issue | 🟡 MinorPotential nil pointer dereference on
Spec.Replicas.
MachineSet.Spec.Replicasis a pointer that could be nil (CAPI defaults it but doesn't guarantee it). Dereferencing without a nil check may cause a panic.🛡️ Suggested fix
for k := range machineSets.Items { machineSet := &machineSets.Items[k] - result[machineSet.Namespace+"/"+machineSet.Name] = *machineSet.Spec.Replicas + replicas := int32(0) + if machineSet.Spec.Replicas != nil { + replicas = *machineSet.Spec.Replicas + } + result[machineSet.Namespace+"/"+machineSet.Name] = replicas }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/metrics/metrics.go` around lines 417 - 420, The loop dereferences MachineSet.Spec.Replicas without a nil check which can panic; update the loop that iterates machineSets.Items (the block referencing machineSet := &machineSets.Items[k] and result[...]=*machineSet.Spec.Replicas) to safely handle nil by first checking machineSet.Spec and machineSet.Spec.Replicas, defaulting to a safe value (e.g., int32(0) or another appropriate default) when nil, and then assigning that value into the result map.test/e2e/util/dump/journals.go-274-277 (1)
274-277:⚠️ Potential issue | 🟡 MinorClose
dumpJournalsLogto avoid descriptor leaksLine [274] opens a file and never closes it. Add a deferred close after successful creation.
💡 Proposed fix
dumpJournalsLog, err := os.Create(dumpJournalsLogFile) if err != nil { return fmt.Errorf("failed to create dumpJournals log: %w", err) } + defer dumpJournalsLog.Close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/dump/journals.go` around lines 274 - 277, The created file handle dumpJournalsLog opened with os.Create is never closed, causing a file descriptor leak; after the successful creation in the block that sets dumpJournalsLog, add a deferred close (e.g., defer dumpJournalsLog.Close()) immediately after the os.Create succeeds so the file is closed when the surrounding function returns (optionally handle/ignore the Close() error).hypershift-operator/controllers/hostedcluster/network_policies.go-244-249 (1)
244-249:⚠️ Potential issue | 🟡 MinorDuplicate error messages hinder debugging.
Both line 242 and line 248 use the same error message
"failed to reconcile konnectivity nodeport network policy". When this error surfaces, it's unclear whetherNodePortKonnectivityNetworkPolicyorNodePortKonnectivityKASNetworkPolicyfailed.Proposed fix for distinct error messages
policy = networkpolicy.NodePortKonnectivityKASNetworkPolicy(controlPlaneNamespaceName) if _, err := createOrUpdate(ctx, r.Client, policy, func() error { return reconcileNodePortKonnectivityKASNetworkPolicy(policy, hcluster) }); err != nil { - return fmt.Errorf("failed to reconcile konnectivity nodeport network policy: %w", err) + return fmt.Errorf("failed to reconcile konnectivity KAS nodeport network policy: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around lines 244 - 249, The error message is ambiguous because both NodePortKonnectivityNetworkPolicy and NodePortKonnectivityKASNetworkPolicy use the same text; update the createOrUpdate error return in the block that calls NodePortKonnectivityKASNetworkPolicy and reconcileNodePortKonnectivityKASNetworkPolicy to use a distinct message (e.g., include "KAS" or "konnectivity KAS nodeport network policy") so it clearly identifies the failing resource when createOrUpdate (and reconcileNodePortKonnectivityKASNetworkPolicy) errors.ignition-server/controllers/local_ignitionprovider.go-566-580 (1)
566-580:⚠️ Potential issue | 🟡 MinorClose
res.Bodybefore retrying non-200 responses.On the non-
200 OKpath, the function returns before the deferred close is registered, so each retry leaks a response body/connection. Move the close immediately afterhttpclient.Do(req)succeeds, then handle the status code.Suggested fix
res, err := httpclient.Do(req) if err != nil { log.Error(err, "mcs request failed") return false, nil } + defer func() { + if err := res.Body.Close(); err != nil { + log.Error(err, "failed to close mcs response body") + } + }() if res.StatusCode != http.StatusOK { - log.Error(err, "mcs returned unexpected response code", "code", res.StatusCode) + log.Info("mcs returned unexpected response code", "code", res.StatusCode) return false, nil } - - defer func() { - if err := res.Body.Close(); err != nil { - log.Error(err, "failed to close mcs response body") - } - }() body, err := io.ReadAll(res.Body)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 566 - 580, The HTTP response body from httpclient.Do(req) is leaked on non-200 paths because the function returns before res.Body.Close() is registered; move the body close immediately after a successful httpclient.Do(req) (e.g., call defer or explicit Close right after res is non-nil) so the body is closed on all return paths, and update the non-200 log in the status-checking branch to not reference the previous err variable (use res.StatusCode and include the response or error text if needed).control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go-530-549 (1)
530-549:⚠️ Potential issue | 🟡 MinorSet the recovery condition
Reasoninside the finished/not-finished branches.
Reasonis initialized tohyperv1.RecoveryFinishedReasonbeforefinishedis known, so the condition can report a “finished” reason while Line 541 still marks recovery incomplete and requeues. That makes the status misleading for consumers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 530 - 549, The condition Reason is set before knowing finished state which can misreport completion; move or overwrite the Reason inside the two branches so the condition's Reason reflects the actual state: in reconcile logic that checks finished from reconcileRestoredCluster(ctx, hcp), set condition.Reason = hyperv1.RecoveryInProgressReason and condition.Status = metav1.ConditionFalse (with an appropriate Message) in the !finished branch, and set condition.Reason = hyperv1.RecoveryFinishedReason and condition.Status = metav1.ConditionTrue (with the finished Message) in the else branch; update the same metav1.Condition object (named condition) so only the correct Reason/Status/Message are emitted for hyperv1.HostedClusterRestoredFromBackup.control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go-2728-2730 (1)
2728-2730:⚠️ Potential issue | 🟡 MinorThis helper no longer verifies that the script still PATCHes the ConfigMap.
Line 2728 uses
&&, but"curl"is already asserted above. IfPATCHdisappears from the script whilecurlremains, this check still passes and the regression goes unnoticed.Suggested fix
- if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") { + if !strings.Contains(script, "PATCH") { t.Error("Check script should PATCH the ConfigMap on success") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go` around lines 2728 - 2730, The current test uses if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") which only fails if BOTH "curl" and "PATCH" are missing; since "curl" is already asserted elsewhere the test will not catch a missing "PATCH". Update the condition to if !strings.Contains(script, "curl") || !strings.Contains(script, "PATCH") or, better, add a dedicated assertion that strings.Contains(script, "PATCH") to ensure the script explicitly contains the PATCH operation (referencing the test variable script in resources_test.go).hypershift-operator/controllers/platform/aws/controller.go-636-643 (1)
636-643:⚠️ Potential issue | 🟡 MinorStop before the success-status path when the current CR was just deleted.
When Line 641 returns after deleting the obsolete
AWSEndpointService, the outerReconcilestill setsAWSEndpointServiceAvailable=Trueand may try to update status on a resource that is already terminating. Bubble this state up so the caller exits before the “healthy” status update.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/platform/aws/controller.go` around lines 636 - 643, When deleteObsoleteEndpointService reports it deleted the CR (done == true), instead of returning nil from reconcileAWSEndpointServiceStatus, return a sentinel error (e.g. var errAWSEndpointServiceDeleted = errors.New("aws endpoint service deleted")) and update the outer Reconcile to check for that sentinel and exit immediately without setting AWSEndpointServiceAvailable=True or updating status; modify reconcileAWSEndpointServiceStatus to return errAWSEndpointServiceDeleted when done==true, add the sentinel error definition near the top of the file, and update the caller (Reconcile) to handle that sentinel by returning early.test/e2e/util/reqserving/verifyenv.go-114-117 (1)
114-117:⚠️ Potential issue | 🟡 MinorVerify if zone count is derived from configuration or should be hardcoded.
The check
len(actualCount) < 3expects exactly 3 zones with non-request serving nodes. Whilesetup.govalidates that the cluster has at least 3 AWS availability zones and explicitly uses only the first 3 (commonZones = commonZones[:3]), this verification function hardcodes the requirement without sourcing it from theClusterSizingConfigurationobject it receives. If the zone configuration changes in the future, this check won't adapt.Consider deriving the expected zone count from the zones actually configured for non-request serving machinesets, or document why this must remain hardcoded to 3.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/reqserving/verifyenv.go` around lines 114 - 117, The hardcoded check using len(actualCount) < 3 should instead derive the expected zone count from the cluster sizing configuration passed into the verification routine: read the configured non-request-serving zones (the same source used in setup.go which slices commonZones to 3) from the ClusterSizingConfiguration object supplied to the verify function (or compute expectedZones by inspecting the non-request-serving machineset definitions) and replace the literal 3 with len(expectedZones); if the codebase intentionally limits to the first 3 zones, add a clear comment referencing setup.go's commonZones = commonZones[:3] and derive expectedZones by applying the same slicing logic so the verification stays consistent with setup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 04d271b2-18e0-4ed4-a175-ace4e25c63a0
📒 Files selected for processing (38)
.golangci.ymlcmd/cluster/core/create.gocmd/fix/dr_oidc_iam.gocmd/infra/aws/create.gocmd/infra/azure/create.gocmd/install/assets/hypershift_operator.gocmd/install/install.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.gocontrol-plane-operator/main.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller.gohypershift-operator/controllers/etcdbackup/reconciler.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/internal/platform/platform.gohypershift-operator/controllers/hostedcluster/metrics/metrics.gohypershift-operator/controllers/hostedcluster/network_policies.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.gohypershift-operator/controllers/nodepool/aws.gohypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/platform/aws/controller.gohypershift-operator/controllers/scheduler/aws/autoscaler.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.gohypershift-operator/main.goignition-server/controllers/local_ignitionprovider.gotest/e2e/util/dump/journals.gotest/e2e/util/reqserving/verifycp.gotest/e2e/util/reqserving/verifyenv.gotest/e2e/util/util.go
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (10)
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (1)
1704-1759: Consider addingt.Parallel()for consistency.The other new tests in this PR (
TestBaseDomainVNetLinkName,TestDNSZoneConfigErrMsgQualifier) uset.Parallel()at both the test and subtest level. This test and the following ones (TestDeleteBaseDomainARecords,TestMapHCPToAzurePLS) do not. Since each subtest creates its own fake client, parallel execution should be safe.♻️ Suggested change
func TestRecordNamesForCR(t *testing.T) { + t.Parallel() tests := []struct {And in the subtest:
for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { + t.Parallel() g := NewGomegaWithT(t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go` around lines 1704 - 1759, Add parallelization to the test TestRecordNamesForCR by calling t.Parallel() at the start of the top-level test and inside each t.Run subtest; update the TestRecordNamesForCR function and each anonymous subtest closure so they invoke t.Parallel(), keeping the existing setup that creates a fresh fake client and uses AzurePrivateLinkServiceReconciler.recordNamesForCR to ensure safe concurrent execution.hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
671-700: Extract shared test fixtures to reduce repeated setup drift.Scheme/client/reconciler/callback setup is duplicated across many new tests. A small fixture helper would make this file easier to maintain and less error-prone.
Also applies to: 776-807, 901-940, 1049-1084, 1128-1159, 1203-1234, 1278-1309
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` around lines 671 - 700, Extract the repeated test setup into a small helper that builds the scheme, registers corev1 and configv1, constructs the fake client with optional objects, and returns the ready-to-use HostedClusterReconciler (and optionally fakeClient) so individual tests can call it instead of duplicating the block in tests that exercise HostedClusterReconciler.getManagementClusterNetwork; update the tests (the blocks around getManagementClusterNetwork) to call the helper and pass tc.objects and tc.capabilities, keeping assertions unchanged.hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go (1)
22-30:expectErroris currently unexercised in this test table.No test case sets
expectError: true, so that branch is dead and can mislead readers about coverage. Either add an injected client-error case or remove the field/branch for clarity.✂️ Minimal cleanup if you keep only non-error-path cases
type testCase struct { name string job *batchv1.Job jobExists bool expectedExists bool expectedFinished bool expectedSuccessful bool - expectError bool } ... - if tt.expectError { - g.Expect(err).To(HaveOccurred()) - } else { - g.Expect(err).ToNot(HaveOccurred()) - g.Expect(status).ToNot(BeNil()) - g.Expect(status.exists).To(Equal(tt.expectedExists)) - g.Expect(status.finished).To(Equal(tt.expectedFinished)) - g.Expect(status.successful).To(Equal(tt.expectedSuccessful)) - } + g.Expect(err).ToNot(HaveOccurred()) + g.Expect(status).ToNot(BeNil()) + g.Expect(status.exists).To(Equal(tt.expectedExists)) + g.Expect(status.finished).To(Equal(tt.expectedFinished)) + g.Expect(status.successful).To(Equal(tt.expectedSuccessful))Also applies to: 176-184
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go` around lines 22 - 30, The table-driven test in etcd_recovery_test.go declares an expectError field that is never set, leaving the error branch untested; either add a test case that sets expectError: true and injects a client error (e.g., simulate API error via the fake client used by the test) to exercise the error path, or remove the expectError field and any assertions/branches that reference it from the tests slice and the test loop; update the tests variable and the corresponding loop logic (the table and any helper that checks expected error behavior) so the code and test cases are consistent.control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go (1)
924-934: Consider parameterizing the expected values in assertions.The assertions hardcode
"test-infra"instead of usingtt.infraID. If additional test cases with different infraIDs are added, the assertions would fail unexpectedly or need to be duplicated.♻️ Suggested improvement
for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { g := NewGomegaWithT(t) filters := vpcEndpointSecurityGroupFilter(tt.infraID, tt.endpointName) g.Expect(filters).To(HaveLen(2)) - g.Expect(aws.ToString(filters[0].Name)).To(Equal("tag:kubernetes.io/cluster/test-infra")) + g.Expect(aws.ToString(filters[0].Name)).To(Equal(fmt.Sprintf("tag:kubernetes.io/cluster/%s", tt.infraID))) g.Expect(filters[0].Values).To(Equal([]string{"owned"})) g.Expect(aws.ToString(filters[1].Name)).To(Equal("tag:Name")) - g.Expect(filters[1].Values).To(Equal([]string{"test-infra-vpce-kube-apiserver-private"})) + g.Expect(filters[1].Values).To(Equal([]string{vpcEndpointSecurityGroupName(tt.infraID, tt.endpointName)})) }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go` around lines 924 - 934, Update the assertions in the test loop to use tt.infraID and tt.endpointName instead of hardcoded "test-infra" and the combined name; specifically assert aws.ToString(filters[0].Name) equals "tag:kubernetes.io/cluster/<tt.infraID>" and assert filters[1].Values equals []string{tt.infraID + "-vpce-" + tt.endpointName} so the expectations follow the inputs used to create filters in vpcEndpointSecurityGroupFilter.control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go (1)
261-761: Consider extracting a shared converter-test helper.The converter tests repeat the same setup/assertion scaffold (build
vmi, call converter, checkerr/login/challenge, cast provider). A small helper would reduce maintenance cost and make provider-specific expectations stand out.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go` around lines 261 - 761, Extract a reusable test helper (e.g., runIDPConverterTest) to factor the repeated scaffold used across TestConvertBasicAuthIDP, TestConvertGitHubIDP, TestConvertGitLabIDP, TestConvertGoogleIDP, TestConvertHTPasswdIDP, TestConvertKeystoneIDP, TestConvertLDAPIDP and TestConvertRequestHeaderIDP: the helper should create newTestVolumeMountInfo(), invoke the provided converter function (convertBasicAuthIDP / convertGitHubIDP / convertGitLabIDP / convertGoogleIDP / convertHTPasswdIDP / convertKeystoneIDP / convertLDAPIDP / convertRequestHeaderIDP) with the test config and index, assert expectation on error vs no error and on data.login/data.challenge, and return the resulting provider (typed as runtime.Object or interface{}) for per-case assertions; replace the duplicated scaffolding in each test with a single call to this helper and then only keep the provider-specific Expect checks (e.g., asserting URLs, CA paths, secret file names, Legacy flag, Attributes, etc.).control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (2)
909-917: Assert the full selector map here.
findServiceForMonitorreturns the matched Service's selector fromscrape_config.go, but this test only checks that a few keys exist. A regression that returned the wrong selector values—or even theServiceMonitorselector instead ofsvc.Spec.Selector—would still pass as long as the key names matched. Please store anexpectedSelectormap in the table and compare it directly.Suggested tightening
tests := []struct { name string objects []runtime.Object smName string selector metav1.LabelSelector expectedSvc string expectedErr bool - expectedSelKeys []string + expectedSelector map[string]string }{ { name: "When service exists with same name as ServiceMonitor, it should find it by direct lookup", objects: []runtime.Object{ newService("kube-apiserver", namespace, "kube-apiserver", "client", 6443), }, - smName: "kube-apiserver", - selector: metav1.LabelSelector{MatchLabels: map[string]string{"app": "kube-apiserver"}}, - expectedSvc: "kube-apiserver", - expectedErr: false, - expectedSelKeys: []string{"app"}, + smName: "kube-apiserver", + selector: metav1.LabelSelector{MatchLabels: map[string]string{"app": "kube-apiserver"}}, + expectedSvc: "kube-apiserver", + expectedErr: false, + expectedSelector: map[string]string{"app": "kube-apiserver"}, }, @@ } else { g.Expect(err).ToNot(HaveOccurred()) g.Expect(svcName).To(Equal(tc.expectedSvc)) - for _, key := range tc.expectedSelKeys { - g.Expect(podSelector).To(HaveKey(key)) - } + g.Expect(podSelector).To(Equal(tc.expectedSelector)) } }) }Also applies to: 962-970
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go` around lines 909 - 917, The test for findServiceForMonitor currently only asserts presence of selector keys; add a full expectedSelector map field to each table entry and replace the partial key-only checks with an equality comparison against svc.Spec.Selector (or the returned selector) to ensure values and keys match exactly; update the table structs to include expectedSelector map[string]string, populate it for each case, and assert deep equality (not just key existence) for the selector returned by findServiceForMonitor and similarly update the related cases around the other test block.
981-1048: Add a case for matching onTargetPort.String().The helper under test in
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go:282-307accepts bothp.Name == portNameandp.TargetPort.String() == portName, but this table only exercises the first form. A case where the service port is named"client"and the lookup asks for"https"would cover the second branch and better protect this refactor.Suggested test case
{ name: "When service has a named targetPort, it should resolve from a pod", objects: []runtime.Object{ newServiceWithTargetPort("my-svc", namespace, "app", "client", 443, intstr.FromString("https")), newPodWithLabels("my-pod", namespace, "https", 6443, map[string]string{"app": "app"}), }, serviceName: "my-svc", portName: "client", podSelector: map[string]string{"app": "app"}, expectedPort: 6443, expectedErr: false, }, + { + name: "When lookup matches the named targetPort, it should resolve from a pod", + objects: []runtime.Object{ + newServiceWithTargetPort("my-svc", namespace, "app", "client", 443, intstr.FromString("https")), + newPodWithLabels("my-pod", namespace, "https", 6443, map[string]string{"app": "app"}), + }, + serviceName: "my-svc", + portName: "https", + podSelector: map[string]string{"app": "app"}, + expectedPort: 6443, + expectedErr: false, + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go` around lines 981 - 1048, The table-driven tests do not cover the branch where a ServicePort is matched via p.TargetPort.String() (the helper in scrape_config.go checks both p.Name == portName and p.TargetPort.String() == portName), so add a test case to the tests slice that uses newServiceWithTargetPort to create a Service with Name "client" and TargetPort intstr.FromString("https"), add a corresponding pod via newPodWithLabels that has the target container port (e.g. 6443), set serviceName to "my-svc" and portName to "https", and expect expectedPort 6443 and expectedErr false to ensure the TargetPort.String() matching branch is exercised.hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go (2)
2206-2209: StrengthenTestClearTransientConditionsassertions.The current checks only validate non-nil outputs. This can miss regressions in condition type/status/reason/message/transition-time content.
Suggested assertion tightening
} else { g.Expect(result).ToNot(BeNil()) g.Expect(result.applyCfg).ToNot(BeNil()) g.Expect(result.applyCfg.Status).ToNot(BeNil()) + g.Expect(result.applyCfg.Status.Conditions).To(HaveLen(2)) + g.Expect(*result.applyCfg.Status.Conditions[0].Type).To(Equal(hypershiftv1beta1.ClusterSizeTransitionPending)) + g.Expect(*result.applyCfg.Status.Conditions[0].Status).To(Equal(metav1.ConditionFalse)) + g.Expect(*result.applyCfg.Status.Conditions[0].Reason).To(Equal("ClusterSizeTransitioned")) + g.Expect(*result.applyCfg.Status.Conditions[1].Type).To(Equal(hypershiftv1beta1.ClusterSizeTransitionRequired)) + g.Expect(*result.applyCfg.Status.Conditions[1].Status).To(Equal(metav1.ConditionFalse)) + g.Expect(*result.applyCfg.Status.Conditions[1].Reason).To(Equal(hypershiftv1beta1.AsExpectedReason)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go` around lines 2206 - 2209, TestClearTransientConditions currently only asserts non-nil for result, result.applyCfg and result.applyCfg.Status; tighten it by asserting that result.applyCfg.Status.Conditions has the expected length and that each Condition's Type, Status, Reason, Message and LastTransitionTime (or its non-zero state) match the expected values for the test case. Locate the test's result variable and its applyCfg.Status.Conditions slice and replace the loose non-nil checks with explicit checks on Conditions[i].Type, Conditions[i].Status, Conditions[i].Reason, Conditions[i].Message and that Conditions[i].LastTransitionTime is set (or equals the expected timestamp) so regressions in condition content will fail the test.
1991-1991: Use identity assertions for allnoOpActionsentinel comparisons.At lines 1991, 1994, 2134, and 2137,
Equal(&noOpAction)checks deep value equality, which can pass for non-sentinel zero-value actions. SincenoOpActionis a sentinel, assert identity instead for both positive and negative comparisons.Suggested changes
- g.Expect(result).To(Equal(&noOpAction)) + g.Expect(result).To(BeIdenticalTo(&noOpAction))- g.Expect(result).ToNot(Equal(&noOpAction)) + g.Expect(result).ToNot(BeIdenticalTo(&noOpAction))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go` at line 1991, Replace deep-equality checks against the sentinel noOpAction with identity assertions: wherever the tests compare result to &noOpAction (e.g., g.Expect(result).To(Equal(&noOpAction))) use Gomega's identity matcher g.Expect(result).To(BeIdenticalTo(&noOpAction)), and for negative checks use g.Expect(result).ToNot(BeIdenticalTo(&noOpAction)) or g.Expect(result).To(Not(BeIdenticalTo(&noOpAction))). Update all occurrences that reference result and noOpAction so the sentinel is asserted by identity, not deep equality.control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (1)
3250-3273: Platform-specific resource test cases need concrete object assertions
TestReconcilePlatformSpecificResourcesmostly validates “no errors” and has empty verification bodies forNone/KubeVirt. That makes routing regressions easy to miss.Suggested direction
{ name: "When platform is AWS, it should reconcile AWS identity webhook resources", platformType: hyperv1.AWSPlatform, verifyObjects: func(g *WithT, c client.Client) { saList := &corev1.ServiceAccountList{} err := c.List(ctx, saList) g.Expect(err).ToNot(HaveOccurred()) + g.Expect(saList.Items).ToNot(BeEmpty(), "expected AWS platform-specific resources to be created") }, }, { name: "When platform is None, it should not create any platform-specific resources", platformType: hyperv1.NonePlatform, verifyObjects: func(g *WithT, c client.Client) { - // No platform-specific resources expected + saList := &corev1.ServiceAccountList{} + g.Expect(c.List(ctx, saList)).To(Succeed()) + g.Expect(saList.Items).To(BeEmpty()) }, }, { name: "When platform is KubeVirt, it should not create AWS or Azure resources", platformType: hyperv1.KubevirtPlatform, verifyObjects: func(g *WithT, c client.Client) { - // KubeVirt is not in the switch; no platform resources should be created + saList := &corev1.ServiceAccountList{} + g.Expect(c.List(ctx, saList)).To(Succeed()) + g.Expect(saList.Items).To(BeEmpty()) }, },Also applies to: 3309-3311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go` around lines 3250 - 3273, Update TestReconcilePlatformSpecificResources' verifyObjects closures to perform concrete assertions instead of being empty: in the AWS case (the closure that currently creates saList) explicitly assert that expected AWS resources exist by listing and asserting non-zero results for types used by the identity webhook (e.g., corev1.ServiceAccountList and admissionregistrationv1.MutatingWebhookConfigurationList referenced in the test), and in the None and KubeVirt cases list those same resource types (e.g., ServiceAccountList and MutatingWebhookConfigurationList) and assert their counts are zero to ensure no platform-specific resources were created; modify the existing verifyObjects closures (the anonymous funcs passed into TestReconcilePlatformSpecificResources) to perform these positive/negative existence checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/cluster/core/create_test.go`:
- Around line 949-1024: The "no-op/default" test cases in TestApplyEtcdConfig
don't explicitly assert the cluster's untouched state; update the test to, when
tc.etcdStorageClass and tc.etcdStorageSize are empty, assert that
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName remains nil
(or the original default) and that Size.String() equals
hyperv1.DefaultPersistentVolumeEtcdStorageSize.String() (i.e., the size was not
overwritten), and apply the same explicit assertions to the other similar tests
that exercise applyEtcdConfig (and related tests noted in the review) so
regressions that initialize or overwrite fields will fail the tests.
- Around line 921-924: Replace the test fixture string "key=val=ue" in the test
case named "When a value contains an equals sign, it should split only on the
first equals" so it no longer triggers codespell; update items from
[]string{"key=val=ue"} to a harmless alternate that still contains an equals
(for example []string{"key=value=extra"}) and adjust the expected map value from
map[string]string{"key":"val=ue"} to map[string]string{"key":"value=extra"} so
the parsing behavior tested remains identical.
In `@cmd/install/install_test.go`:
- Around line 1052-1094: TestSetupRBAC currently only asserts a minimum object
count via HaveLen which can miss missing admin RBAC resources; update the test
that calls setupRBAC (using Options.EnableAdminRBACGeneration and
AzurePLSManagedIdentityClientID) to explicitly assert the presence of the admin
RBAC objects when EnableAdminRBACGeneration is true (e.g., check for
Role/RoleBinding/ServiceAccount names like "hypershift-client" and
"hypershift-readers" in the returned objects), and change the length assertion
to a floor check (use BeNumerically(">=", tc.minObjectCount) or assert
len(objects) >= tc.minObjectCount) rather than HaveLen so additions won’t break
the test; also keep the existing SA annotation check for
AzurePLSManagedIdentityClientID.
- Around line 1005-1049: The table-driven TestSetupMonitoring currently ignores
expectSLOAlerts and expectDashboards and asserts exact length; update the test
(in TestSetupMonitoring) to 1) use a lower-bound smoke check
(g.Expect(len(objects)).To(BeNumerically(">=", tc.minResourceCount))) instead of
HaveLen, and 2) assert presence/absence of the specific monitoring resources
described by the table: when tc.expectSLOAlerts is true, verify setupMonitoring
returns at least one PrometheusRule-like object (check
object.GetObjectKind().GroupVersionKind().Kind == "PrometheusRule" or
reflect.Type/name contains "PrometheusRule"); when false, assert no such object
exists; when tc.expectDashboards is true, verify a dashboard-like resource is
present (e.g., a Dashboard CR or a ConfigMap whose Name/labels indicate a
dashboard for tc.opts.Namespace), and when false assert none; keep using
NewGomegaWithT and the setupMonitoring/Options values to locate
namespace-specific dashboard checks.
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`:
- Around line 3980-4013: The test currently only asserts an error when the List
interceptor forces a failure; update the "When component list fails..." branch
to also fetch the HCP from the fake client after calling
reconcileControlPlaneVersionStatus and assert that the
HCP.Status.ControlPlaneVersion (and/or related partial-version fields) was
written (i.e., contains a non-empty Version or Partial=true/expected condition)
despite the error. Use the same client c (or c.Status()) to Get the latest
tc.hcp by namespaced name and add Gomega assertions that the status fields have
the expected partial values; keep these checks alongside the existing err
expectation in the test that calls reconcileControlPlaneVersionStatus.
In
`@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go`:
- Around line 325-326: The test assertions in idp_convert_test.go are checking
for underscored helper names like "idp_cm_0_ca"/"idp_secret_0" but the
production helpers IDPVolumeMountInfo.ConfigMapPath and
IDPVolumeMountInfo.SecretPath generate hyphenated names ("idp-cm-0",
"idp-secret-0"), so update the expectations in the test (references to
provider.RemoteConnectionInfo.CA, provider.RemoteConnectionInfo.Secret, and any
similar assertions) to use the hyphenated forms (e.g., "idp-cm-0-ca",
"idp-secret-0") to match the values produced by
IDPVolumeMountInfo.ConfigMapPath/SecretPath; apply the same change to all other
similar assertions in this test file.
- Line 850: The test currently dereferences vmi.Volumes[0].Secret.DefaultMode
directly which can panic if nil; change the assertion to first assert the
pointer is non-nil (e.g.,
Expect(vmi.Volumes[0].Secret.DefaultMode).ToNot(BeNil())) and then check the
value, or use a pointer-aware matcher such as PointTo to assert the pointed
value equals int32(0640)); locate the assertion referencing
vmi.Volumes[0].Secret.DefaultMode in the idp_convert_test.go and replace the
direct dereference with the safer two-step or PointTo-style assertion.
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go`:
- Around line 292-327: The test checks CA mount path but misses TLS cert/key
paths; update the test in convertBasicAuthIDP table run to also assert that
provider.RemoteConnectionInfo.CertFile and provider.RemoteConnectionInfo.KeyFile
(from convertBasicAuthIDP) contain the expected mounted secret names when
tc.config.BasicAuth.TLSClientCert/TLSClientKey are set (e.g. contain substrings
like "idp_secret_0_cert" and "idp_secret_0_key" or the naming your mount logic
uses). Locate the assertions after provider :=
data.provider.(*osinv1.BasicAuthPasswordIdentityProvider) and add checks similar
to the existing CA check to validate CertFile and KeyFile mount path wiring.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2732-2734: The test currently allows a script that contains "curl"
but lacks "PATCH" to pass because it uses a logical AND; update the assertion in
resources_test.go so it fails if either token is missing by changing the
condition that checks the variable "script" to use a logical OR (i.e., assert
that both "curl" and "PATCH" are present), and keep the same error message
"Check script should PATCH the ConfigMap on success".
- Around line 3197-3234: When tt.expectCloudCleanup is true, add a positive
assertion after calling r.reconcileDeletion that the cloud-cleanup path actually
ran by asserting the returned result signals a requeue (e.g., check
result.RequeueAfter is non-zero/greater than 0); locate the block that creates
kasDeployment and the call to r.reconcileDeletion and insert the assertion about
result.RequeueAfter there so regressions to the cleanup gate are caught.
- Around line 3216-3230: The test currently only checks VAP deletion conditioned
on tt.expectVAPDeleted, which ignores tt.expectVAPBDeleted; update the
assertions to independently verify the binding by checking
manifests.ValidatingAdmissionPolicyBinding(fmt.Sprintf("%s-binding",
registry.AdmissionPolicyNameManagementState)) against guestClient.Get depending
on tt.expectVAPBDeleted (i.e., assert IsNotFound when tt.expectVAPBDeleted is
true), and when neither expectVAPDeleted nor expectVAPBDeleted is true (and
len(tt.existingObjects) > 0) assert that the binding Get does not return an
error, mirroring the existing VAP existence checks.
In
`@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go`:
- Line 1423: Update the test case name string that currently reads "When min
interval is unparseable, it should allow snapshot creation" to correct the
spelling to "unparsable" so codespell no longer flags it; locate the test case's
name field in snapshot_controller_test.go (the test case struct/name entry) and
replace "unparseable" with "unparsable".
In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go`:
- Around line 534-536: The RoleBinding assertions currently check
roleBinding.RoleRef.Kind and .Name but omit the API group; update the test in
etcd_recovery_test.go to also assert roleBinding.RoleRef.APIGroup equals the
expected API group (e.g. "rbac.authorization.k8s.io") so the RoleRef contract is
fully validated alongside the existing checks for roleBinding.RoleRef.Kind,
roleBinding.RoleRef.Name and roleBinding.Subjects length.
- Around line 517-519: The test currently asserts role.Rules[1].Resources and
Verbs for the second RBAC rule (statefulsets) but misses validating APIGroups;
update the assertion for role.Rules[1] to also assert APIGroups contains the
correct API group for statefulsets (e.g.,
g.Expect(role.Rules[1].APIGroups).To(ConsistOf("apps"))) so the rule is fully
validated alongside the existing Resources and Verbs checks.
- Around line 490-497: The test currently only asserts presence of
EtcdRecoveryActive when tt.expectCondition is true; add a negative assertion for
the false branch by fetching the updated HostedCluster (same pattern using
fakeClient.Get and updatedHC := &hyperv1.HostedCluster{}), then call
findCondition(updatedHC.Status.Conditions, string(hyperv1.EtcdRecoveryActive))
and assert that it is nil when tt.expectCondition is false so the test fails if
the condition is erroneously present.
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 788-796: The createOrUpdate test stub (the upsert.CreateOrUpdateFN
assigned to createOrUpdate in network_policies_test.go) currently treats
non-*networkingv1.NetworkPolicy objects as success and swallows unexpected
types; update the stub so it validates the obj type, returns a clear error (and
controllerutil.OperationResultNone) when obj is not a
*networkingv1.NetworkPolicy, and keep the existing behavior of calling the
mutate function, recording into createdPolicies (or the equivalent map) and
returning OperationResultCreated only for valid NetworkPolicy objects; apply the
same change to each identical stub instance (the ones creating createOrUpdate
around createdPolicies at the listed locations).
---
Nitpick comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 924-934: Update the assertions in the test loop to use tt.infraID
and tt.endpointName instead of hardcoded "test-infra" and the combined name;
specifically assert aws.ToString(filters[0].Name) equals
"tag:kubernetes.io/cluster/<tt.infraID>" and assert filters[1].Values equals
[]string{tt.infraID + "-vpce-" + tt.endpointName} so the expectations follow the
inputs used to create filters in vpcEndpointSecurityGroupFilter.
In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`:
- Around line 1704-1759: Add parallelization to the test TestRecordNamesForCR by
calling t.Parallel() at the start of the top-level test and inside each t.Run
subtest; update the TestRecordNamesForCR function and each anonymous subtest
closure so they invoke t.Parallel(), keeping the existing setup that creates a
fresh fake client and uses AzurePrivateLinkServiceReconciler.recordNamesForCR to
ensure safe concurrent execution.
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 909-917: The test for findServiceForMonitor currently only asserts
presence of selector keys; add a full expectedSelector map field to each table
entry and replace the partial key-only checks with an equality comparison
against svc.Spec.Selector (or the returned selector) to ensure values and keys
match exactly; update the table structs to include expectedSelector
map[string]string, populate it for each case, and assert deep equality (not just
key existence) for the selector returned by findServiceForMonitor and similarly
update the related cases around the other test block.
- Around line 981-1048: The table-driven tests do not cover the branch where a
ServicePort is matched via p.TargetPort.String() (the helper in scrape_config.go
checks both p.Name == portName and p.TargetPort.String() == portName), so add a
test case to the tests slice that uses newServiceWithTargetPort to create a
Service with Name "client" and TargetPort intstr.FromString("https"), add a
corresponding pod via newPodWithLabels that has the target container port (e.g.
6443), set serviceName to "my-svc" and portName to "https", and expect
expectedPort 6443 and expectedErr false to ensure the TargetPort.String()
matching branch is exercised.
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go`:
- Around line 261-761: Extract a reusable test helper (e.g.,
runIDPConverterTest) to factor the repeated scaffold used across
TestConvertBasicAuthIDP, TestConvertGitHubIDP, TestConvertGitLabIDP,
TestConvertGoogleIDP, TestConvertHTPasswdIDP, TestConvertKeystoneIDP,
TestConvertLDAPIDP and TestConvertRequestHeaderIDP: the helper should create
newTestVolumeMountInfo(), invoke the provided converter function
(convertBasicAuthIDP / convertGitHubIDP / convertGitLabIDP / convertGoogleIDP /
convertHTPasswdIDP / convertKeystoneIDP / convertLDAPIDP /
convertRequestHeaderIDP) with the test config and index, assert expectation on
error vs no error and on data.login/data.challenge, and return the resulting
provider (typed as runtime.Object or interface{}) for per-case assertions;
replace the duplicated scaffolding in each test with a single call to this
helper and then only keep the provider-specific Expect checks (e.g., asserting
URLs, CA paths, secret file names, Legacy flag, Attributes, etc.).
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 3250-3273: Update TestReconcilePlatformSpecificResources'
verifyObjects closures to perform concrete assertions instead of being empty: in
the AWS case (the closure that currently creates saList) explicitly assert that
expected AWS resources exist by listing and asserting non-zero results for types
used by the identity webhook (e.g., corev1.ServiceAccountList and
admissionregistrationv1.MutatingWebhookConfigurationList referenced in the
test), and in the None and KubeVirt cases list those same resource types (e.g.,
ServiceAccountList and MutatingWebhookConfigurationList) and assert their counts
are zero to ensure no platform-specific resources were created; modify the
existing verifyObjects closures (the anonymous funcs passed into
TestReconcilePlatformSpecificResources) to perform these positive/negative
existence checks.
In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go`:
- Around line 22-30: The table-driven test in etcd_recovery_test.go declares an
expectError field that is never set, leaving the error branch untested; either
add a test case that sets expectError: true and injects a client error (e.g.,
simulate API error via the fake client used by the test) to exercise the error
path, or remove the expectError field and any assertions/branches that reference
it from the tests slice and the test loop; update the tests variable and the
corresponding loop logic (the table and any helper that checks expected error
behavior) so the code and test cases are consistent.
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 671-700: Extract the repeated test setup into a small helper that
builds the scheme, registers corev1 and configv1, constructs the fake client
with optional objects, and returns the ready-to-use HostedClusterReconciler (and
optionally fakeClient) so individual tests can call it instead of duplicating
the block in tests that exercise
HostedClusterReconciler.getManagementClusterNetwork; update the tests (the
blocks around getManagementClusterNetwork) to call the helper and pass
tc.objects and tc.capabilities, keeping assertions unchanged.
In
`@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go`:
- Around line 2206-2209: TestClearTransientConditions currently only asserts
non-nil for result, result.applyCfg and result.applyCfg.Status; tighten it by
asserting that result.applyCfg.Status.Conditions has the expected length and
that each Condition's Type, Status, Reason, Message and LastTransitionTime (or
its non-zero state) match the expected values for the test case. Locate the
test's result variable and its applyCfg.Status.Conditions slice and replace the
loose non-nil checks with explicit checks on Conditions[i].Type,
Conditions[i].Status, Conditions[i].Reason, Conditions[i].Message and that
Conditions[i].LastTransitionTime is set (or equals the expected timestamp) so
regressions in condition content will fail the test.
- Line 1991: Replace deep-equality checks against the sentinel noOpAction with
identity assertions: wherever the tests compare result to &noOpAction (e.g.,
g.Expect(result).To(Equal(&noOpAction))) use Gomega's identity matcher
g.Expect(result).To(BeIdenticalTo(&noOpAction)), and for negative checks use
g.Expect(result).ToNot(BeIdenticalTo(&noOpAction)) or
g.Expect(result).To(Not(BeIdenticalTo(&noOpAction))). Update all occurrences
that reference result and noOpAction so the sentinel is asserted by identity,
not deep equality.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 56392441-83ef-4758-9a14-1ab2d72d0d01
📒 Files selected for processing (21)
cmd/cluster/core/create_test.gocmd/install/assets/hypershift_operator_test.gocmd/install/install_test.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.gohypershift-operator/controllers/etcdbackup/reconciler_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery_test.gohypershift-operator/controllers/hostedcluster/network_policies_test.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.gohypershift-operator/controllers/nodepool/aws_test.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/platform/aws/controller_test.gohypershift-operator/controllers/scheduler/aws/autoscaler_test.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.goignition-server/controllers/local_ignitionprovider_test.go
|
@bryan-cox: This pull request references CNTRLPLANE-3324 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cmd/cluster/core/create_test.go (1)
1014-1020:⚠️ Potential issue | 🟡 MinorNo-op/default rows still don’t assert untouched state.
The “leave defaults unchanged / should not modify the cluster” cases still only assert positive paths. Please add explicit
elseassertions so unintended initialization/overwrites fail tests.🧪 Minimal tightening
@@ } else { g.Expect(err).NotTo(HaveOccurred()) if tc.expectStorageClass != nil { g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(Equal(tc.expectStorageClass)) + } else { + g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(BeNil()) } if tc.expectSizeString != "" { g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).To(Equal(tc.expectSizeString)) + } else { + g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()). + To(Equal(hyperv1.DefaultPersistentVolumeEtcdStorageSize.String())) } } @@ if tc.expectOperatorHub { g.Expect(cluster.Spec.Configuration.OperatorHub).NotTo(BeNil()) g.Expect(cluster.Spec.Configuration.OperatorHub.DisableAllDefaultSources).To(BeTrue()) + } else { + g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil()) } if tc.expectCatalogPlacement != "" { g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal(tc.expectCatalogPlacement)) + } else { + g.Expect(cluster.Spec.OLMCatalogPlacement).To(BeEmpty()) } @@ if tc.expectNodeSelector != nil { g.Expect(cluster.Spec.NodeSelector).To(Equal(tc.expectNodeSelector)) + } else { + g.Expect(cluster.Spec.NodeSelector).To(BeNil()) } if tc.expectLabels != nil { g.Expect(cluster.Spec.Labels).To(Equal(tc.expectLabels)) + } else { + g.Expect(cluster.Spec.Labels).To(BeNil()) }Also applies to: 1135-1141, 1332-1337
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create_test.go` around lines 1014 - 1020, The test currently only asserts fields when tc.expectStorageClass or tc.expectSizeString are set; add explicit else branches so the "no-op/default" cases fail on unintended changes: after the existing if blocks for cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and Size.String(), add corresponding else clauses that assert Storage.PersistentVolume.StorageClassName is nil/empty (e.g., Expect(...).To(BeNil() or Equal(""))) and that the Size is zero/empty (e.g., Expect(...).To(BeZero() or Equal("")) or Expect(...Size.IsZero()).To(BeTrue())). Do this around the same checks (the blocks referencing cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and .Size.String()) and repeat the pattern for the other occurrences noted (the similar assertions at the other ranges).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go`:
- Around line 1369-1375: The test reads tt.annotations[...] after
getLastObservedRestartCount mutates pod.Annotations (they share the same map),
so the corrupted-annotation branch can be skipped; fix by capturing the original
annotation value before calling getLastObservedRestartCount (e.g., store
tt.annotations[lastObservedRestartCountAnnotation] or a copy of tt.annotations
into a local variable) and use that captured value in the conditional that
verifies updatedPod.Annotations[lastObservedRestartCountAnnotation] equals "0";
reference getLastObservedRestartCount, lastObservedRestartCountAnnotation,
tt.annotations, pod.Annotations and the fakeClient.Get/updatedPod check when
making the change.
---
Duplicate comments:
In `@cmd/cluster/core/create_test.go`:
- Around line 1014-1020: The test currently only asserts fields when
tc.expectStorageClass or tc.expectSizeString are set; add explicit else branches
so the "no-op/default" cases fail on unintended changes: after the existing if
blocks for cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName
and Size.String(), add corresponding else clauses that assert
Storage.PersistentVolume.StorageClassName is nil/empty (e.g.,
Expect(...).To(BeNil() or Equal(""))) and that the Size is zero/empty (e.g.,
Expect(...).To(BeZero() or Equal("")) or Expect(...Size.IsZero()).To(BeTrue())).
Do this around the same checks (the blocks referencing
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and
.Size.String()) and repeat the pattern for the other occurrences noted (the
similar assertions at the other ranges).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ac5353b0-dcb8-4206-b669-fce20cbe753e
📒 Files selected for processing (2)
cmd/cluster/core/create_test.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
c52964e to
47afd7e
Compare
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.go (1)
104-109:⚠️ Potential issue | 🔴 CriticalNon-NotFound errors are silently ignored.
When
r.Getreturns an error other thanNotFound, the function continues with an uninitializedetcdStatefulSet. This will have zero-valued status fields (ReadyReplicas == 0,AvailableReplicas == 0), potentially triggering incorrect recovery behavior on transient API errors.🐛 Proposed fix to return non-NotFound errors
if err := r.Get(ctx, crclient.ObjectKeyFromObject(etcdStatefulSet), etcdStatefulSet); err != nil { if apierrors.IsNotFound(err) { log.Info("etcd statefulset does not yet exist") return nil, nil } + return nil, fmt.Errorf("failed to get etcd statefulset: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines 104 - 109, The current r.Get call in etcd_recovery.go silently ignores non-NotFound errors and proceeds with an uninitialized etcdStatefulSet; update the error handling for r.Get(ctx, crclient.ObjectKeyFromObject(etcdStatefulSet), etcdStatefulSet) so that if apierrors.IsNotFound(err) you log and return (nil, nil) as now, but for any other err you immediately return the error (e.g., return nil, err) instead of continuing; this ensures functions like the caller that inspect etcdStatefulSet.Status (ReadyReplicas/AvailableReplicas) don’t operate on zero-valued fields when the API returned a transient error.hypershift-operator/controllers/hostedcluster/metrics/metrics.go (1)
259-287:⚠️ Potential issue | 🟠 MajorAdd synchronization to protect
lastCollectTimefrom concurrent access.The
Collectmethod may be invoked concurrently by the Prometheus registry. Without synchronization, overlapping scrapes can race onlastCollectTime, causing missed or duplicated transition duration histogram observations.Suggested fix
import ( "context" "fmt" + "sync" "time" @@ type hostedClustersMetricsCollector struct { client.Client clock clock.Clock + mu sync.Mutex transitionDurationMetric *prometheus.HistogramVec lastCollectTime time.Time } @@ func (c *hostedClustersMetricsCollector) Collect(ch chan<- prometheus.Metric) { + c.mu.Lock() + defer c.mu.Unlock() + currentCollectTime := c.clock.Now() log := ctrllog.Log🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go` around lines 259 - 287, The Collect method reads and updates the hostedClustersMetricsCollector.lastCollectTime without synchronization; add a mutex (e.g., sync.Mutex or sync.RWMutex) as a field on hostedClustersMetricsCollector (e.g., mu) and use it to protect all concurrent accesses to lastCollectTime: acquire the lock when reading lastCollectTime inside Collect and in any other method that reads it (notably collectTransitionDurationMetrics if it reads lastCollectTime), and hold the lock briefly when assigning c.lastCollectTime = currentCollectTime at the end of Collect; release the lock promptly to avoid long-held locks during listing/metric emission.
♻️ Duplicate comments (14)
cmd/cluster/core/create_test.go (3)
1135-1141:⚠️ Potential issue | 🟡 MinorNo-op OLM case should explicitly verify untouched fields.
The
"should not modify the cluster"row does not assert the negative path, so unexpected initialization ofOperatorHuborOLMCatalogPlacementwould still pass. (Line 1135 and Line 1139)✅ Minimal tightening
applyOLMConfig(cluster, opts) if tc.expectOperatorHub { g.Expect(cluster.Spec.Configuration.OperatorHub).NotTo(BeNil()) g.Expect(cluster.Spec.Configuration.OperatorHub.DisableAllDefaultSources).To(BeTrue()) + } else { + g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil()) } if tc.expectCatalogPlacement != "" { g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal(tc.expectCatalogPlacement)) + } else { + g.Expect(cluster.Spec.OLMCatalogPlacement).To(BeEmpty()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create_test.go` around lines 1135 - 1141, The test doesn't assert the negative paths for the "should not modify the cluster" case; update the assertions around cluster.Spec.Configuration.OperatorHub and cluster.Spec.OLMCatalogPlacement so that when tc.expectOperatorHub is false you assert OperatorHub is nil (e.g., g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil())), and when tc.expectCatalogPlacement is empty you assert cluster.Spec.OLMCatalogPlacement is empty/zero value (e.g., g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal("" )) ), using the existing tc.expectOperatorHub and tc.expectCatalogPlacement checks to drive the conditional expectations.
1015-1020:⚠️ Potential issue | 🟡 MinorNo-op etcd case still does not assert unchanged state.
For the
"leave defaults unchanged"row, the test still skips assertions when expected fields are empty, so a regression that mutates defaults can pass undetected. (Line 1015 and Line 1018)✅ Minimal tightening
} else { g.Expect(err).NotTo(HaveOccurred()) if tc.expectStorageClass != nil { g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(Equal(tc.expectStorageClass)) + } else { + g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(BeNil()) } if tc.expectSizeString != "" { g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).To(Equal(tc.expectSizeString)) + } else { + g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()). + To(Equal(hyperv1.DefaultPersistentVolumeEtcdStorageSize.String())) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create_test.go` around lines 1015 - 1020, The test for the "leave defaults unchanged" case is skipping assertions when tc.expectStorageClass is nil and tc.expectSizeString is empty, so regressions that mutate defaults go unnoticed; update the test to explicitly assert unchanged state by either (A) setting tc.expectStorageClass and tc.expectSizeString to the original expected values for that test case, or (B) add an explicit flag (e.g., tc.expectUnchanged) and when set compare cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and .Size.String() against a saved copy of the pre-run cluster state (or original defaults) instead of skipping the checks; reference the test table entry for the "leave defaults unchanged" row and the assertions around cluster.Spec.Etcd.Managed.Storage.PersistentVolume in the test function to implement the check.
1332-1337:⚠️ Potential issue | 🟡 MinorNo-op scheduling case should assert nil defaults.
For the
"should not modify the cluster"path, the test only checks positive cases, so unintended writes toNodeSelector/Labelscan slip through. (Line 1332 and Line 1335)✅ Minimal tightening
applySchedulingConfig(cluster, opts) if tc.expectNodeSelector != nil { g.Expect(cluster.Spec.NodeSelector).To(Equal(tc.expectNodeSelector)) + } else { + g.Expect(cluster.Spec.NodeSelector).To(BeNil()) } if tc.expectLabels != nil { g.Expect(cluster.Spec.Labels).To(Equal(tc.expectLabels)) + } else { + g.Expect(cluster.Spec.Labels).To(BeNil()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create_test.go` around lines 1332 - 1337, The test currently only asserts cluster.Spec.NodeSelector and cluster.Spec.Labels when tc.expectNodeSelector/expectLabels are non-nil, allowing unintended mutations to slip through; update the assertions in the "should not modify the cluster" path so that when tc.expectNodeSelector is nil you explicitly assert cluster.Spec.NodeSelector is nil (and similarly when tc.expectLabels is nil assert cluster.Spec.Labels is nil), otherwise assert equality to tc.expectNodeSelector / tc.expectLabels; reference the existing symbols tc.expectNodeSelector, tc.expectLabels, cluster.Spec.NodeSelector and cluster.Spec.Labels to locate and modify the checks.control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go (1)
192-196:⚠️ Potential issue | 🔴 CriticalFix the BasicAuth TLS client cert/key volume name collision.
Both
TLSClientCertandTLSClientKeyuse"tls-client-key"as the field name inSecretPath. When both are configured, this generates duplicate volume names (idp-secret-{i}-tls-client-key), causing the cert mount to be overwritten or invalid pod configuration.Note that
convertKeystoneIDP(lines 311-316) correctly uses distinct field names:"tls-client-cert"for the cert and"tls-client-key"for the key.,
Proposed fix
if basicAuthConfig.TLSClientCert.Name != "" { - provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-key", corev1.TLSCertKey) + provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey) } if basicAuthConfig.TLSClientKey.Name != "" { provider.RemoteConnectionInfo.KeyFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientKey.Name, "tls-client-key", corev1.TLSPrivateKeyKey) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go` around lines 192 - 196, The TLS client cert and key SecretPath calls under basicAuthConfig are using the same field name "tls-client-key" causing duplicate volume names; update the cert call so idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey) sets provider.RemoteConnectionInfo.CertFile while keeping the key call as SecretPath(..., "tls-client-key", corev1.TLSPrivateKeyKey); locate these in the same block that sets provider.RemoteConnectionInfo in idp_convert.go and mirror the distinct field naming used by convertKeystoneIDP.control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go (1)
152-156:⚠️ Potential issue | 🔴 CriticalFix the BasicAuth TLS client cert/key volume name collision.
Same issue as the v2 version: both
TLSClientCertandTLSClientKeyuse"tls-client-key"as the field name. This causes volume name collisions when both are configured.
convertKeystoneIDP(lines 243-248) correctly uses distinct field names.,
Proposed fix
if basicAuthConfig.TLSClientCert.Name != "" { - provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-key", corev1.TLSCertKey) + provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey) } if basicAuthConfig.TLSClientKey.Name != "" { provider.RemoteConnectionInfo.KeyFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientKey.Name, "tls-client-key", corev1.TLSPrivateKeyKey) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go` around lines 152 - 156, The TLS client cert/key volume name is colliding because both calls to idpVolumeMounts.SecretPath use the same field name "tls-client-key"; update the call that sets provider.RemoteConnectionInfo.CertFile (in the same conversion function where basicAuthConfig.TLSClientCert and TLSClientKey are handled) to use a distinct field name such as "tls-client-cert" (mirror the pattern used in convertKeystoneIDP) so CertFile uses "tls-client-cert" and KeyFile keeps "tls-client-key", leaving idpVolumeMounts.SecretPath and provider.RemoteConnectionInfo.* assignments otherwise unchanged.hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
788-796:⚠️ Potential issue | 🟡 MinorFail fast on unexpected object types in test stubs.
These
createOrUpdatestubs still treat non-*networkingv1.NetworkPolicyobjects as success, which can mask regressions.Suggested hardening pattern (apply to each stub)
+import "fmt" ... createOrUpdate := upsert.CreateOrUpdateFN(func(ctx context.Context, c client.Client, obj client.Object, f controllerutil.MutateFn) (controllerutil.OperationResult, error) { - if netPol, ok := obj.(*networkingv1.NetworkPolicy); ok { - if err := f(); err != nil { - return controllerutil.OperationResultNone, err - } - createdPolicies[netPol.Name] = netPol - } - return controllerutil.OperationResultCreated, nil + netPol, ok := obj.(*networkingv1.NetworkPolicy) + if !ok { + return controllerutil.OperationResultNone, fmt.Errorf("unexpected object type: %T", obj) + } + if err := f(); err != nil { + return controllerutil.OperationResultNone, err + } + createdPolicies[netPol.Name] = netPol + return controllerutil.OperationResultCreated, nil })Also applies to: 912-920, 1060-1068, 1139-1147, 1214-1222, 1289-1297
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` around lines 788 - 796, The test stub createOrUpdate (constructed via upsert.CreateOrUpdateFN) currently treats any non-*networkingv1.NetworkPolicy as success and silently ignores type mismatches; change the stub to fail fast by returning an error when obj is not a *networkingv1.NetworkPolicy (e.g., check type in the func for createOrUpdate and return controllerutil.OperationResultNone with a descriptive error if the assertion fails), and apply the same hardening to the other stubs mentioned (the similar CreateOrUpdateFN instances that track createdPolicies in this file) so unexpected object types surface as test failures rather than being treated as success.hypershift-operator/controllers/hostedcluster/network_policies.go (1)
115-118:⚠️ Potential issue | 🔴 CriticalAdd nil-safety before iterating management cluster networks.
getManagementClusterNetworkcan returnnilon Line 117, butreconcileVirtLauncherNetworkPolicydereferences it on Lines 612 and 616. This can panic for KubeVirt whenCapabilityNetworksis unavailable.Suggested fix
func reconcileVirtLauncherNetworkPolicy(log logr.Logger, policy *networkingv1.NetworkPolicy, hcluster *hyperv1.HostedCluster, managementClusterNetwork *configv1.Network) error { protocolTCP := corev1.ProtocolTCP protocolUDP := corev1.ProtocolUDP protocolSCTP := corev1.ProtocolSCTP blockedIPv4Networks := []string{} blockedIPv6Networks := []string{} - for _, network := range managementClusterNetwork.Spec.ClusterNetwork { - blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network.CIDR, blockedIPv4Networks, blockedIPv6Networks) - } - - for _, network := range managementClusterNetwork.Spec.ServiceNetwork { - blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network, blockedIPv4Networks, blockedIPv6Networks) + if managementClusterNetwork != nil { + for _, network := range managementClusterNetwork.Spec.ClusterNetwork { + blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network.CIDR, blockedIPv4Networks, blockedIPv6Networks) + } + for _, network := range managementClusterNetwork.Spec.ServiceNetwork { + blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network, blockedIPv4Networks, blockedIPv6Networks) + } }Also applies to: 612-617
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around lines 115 - 118, getManagementClusterNetwork can return nil when CapabilityNetworks is unavailable; update reconcileVirtLauncherNetworkPolicy to guard against that by checking the returned *configv1.Network for nil before dereferencing or iterating its fields (e.g., before accessing network.Spec or looping over network.Status/NetworkType), and if nil, skip or return early from the parts that require management-cluster networks; alternatively ensure getManagementClusterNetwork never returns nil by returning an empty Network object—refer to getManagementClusterNetwork and reconcileVirtLauncherNetworkPolicy to apply the nil-check and early-return fix.control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)
627-635:⚠️ Potential issue | 🔴 CriticalIndex out of bounds: length check must precede slice access.
Line 627 accesses
output.VpcEndpoints[0]before the empty-slice check at lines 631-634. If AWS returns a successful response with zero endpoints, this causes a panic.🐛 Proposed fix
- if err := deleteEndpointIfWrongService(ctx, ec2Client, output.VpcEndpoints[0], awsEndpointService.Status.EndpointServiceName, log); err != nil { - return "", nil, err - } - if len(output.VpcEndpoints) == 0 { awsEndpointService.Status.EndpointID = "" return "", nil, fmt.Errorf("endpoint with id %s not found, resetting status", endpointID) } + + if err := deleteEndpointIfWrongService(ctx, ec2Client, output.VpcEndpoints[0], awsEndpointService.Status.EndpointServiceName, log); err != nil { + return "", nil, err + } + log.Info("endpoint exists", "endpointID", endpointID)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go` around lines 627 - 635, The code calls deleteEndpointIfWrongService(ctx, ec2Client, output.VpcEndpoints[0], awsEndpointService.Status.EndpointServiceName, log) before checking len(output.VpcEndpoints), which can panic when VpcEndpoints is empty; fix by first verifying len(output.VpcEndpoints) > 0 and handling the empty case (reset awsEndpointService.Status.EndpointID, return the formatted error referencing endpointID) before any access to output.VpcEndpoints, then call deleteEndpointIfWrongService with output.VpcEndpoints[0] and proceed to log the existing endpoint.control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go (2)
4008-4012:⚠️ Potential issue | 🟡 MinorVerify the partial version status on the forced list-failure path.
The behavior under test here is not just “returns an error”;
reconcileControlPlaneVersionStatusis also supposed to persist a PartialControlPlaneVersionbefore returning. A broken status patch still passes this test today.Suggested assertion
originalHCP := tc.hcp.DeepCopy() err := r.reconcileControlPlaneVersionStatus(t.Context(), tc.hcp, originalHCP, releaseImage) if tc.expectError { g.Expect(err).To(HaveOccurred()) + if tc.name == "When component list fails, it should set partial version and return error" { + updated := &hyperv1.HostedControlPlane{} + g.Expect(c.Get(t.Context(), client.ObjectKeyFromObject(tc.hcp), updated)).To(Succeed()) + g.Expect(updated.Status.ControlPlaneVersion.Desired.Image).To(Equal(resolvedRef.String())) + g.Expect(updated.Status.ControlPlaneVersion.History).ToNot(BeEmpty()) + g.Expect(updated.Status.ControlPlaneVersion.History[0].State).To(Equal(configv1.PartialUpdate)) + } return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go` around lines 4008 - 4012, The test currently only asserts an error was returned from reconcileControlPlaneVersionStatus but must also verify that a Partial ControlPlaneVersion was persisted before the error; after calling reconcileControlPlaneVersionStatus with originalHCP, tc.hcp and releaseImage (and when tc.expectError is true) fetch the updated HCP from the fake client and assert its Status.ControlPlaneVersion (or the ControlPlaneVersion slice/field used by reconcileControlPlaneVersionStatus) contains a new entry with Version equal to the releaseImage version and State/Type set to Partial (or the enum/string the controller uses) — add this assertion immediately after expecting the error to ensure the forced-list-failure path writes the partial status.
2780-2900:⚠️ Potential issue | 🟡 MinorAssert
EtcdSnapshotRestoredin the restore-snapshot case.This case is named around restore completion, but it only verifies
EtcdAvailable. A regression in the restore-status branch would still pass.Suggested assertion
cond := meta.FindStatusCondition(tc.hcp.Status.Conditions, string(hyperv1.EtcdAvailable)) g.Expect(cond).ToNot(BeNil()) g.Expect(cond.Type).To(Equal(tc.expectedCondition.Type)) g.Expect(cond.Status).To(Equal(tc.expectedCondition.Status)) g.Expect(cond.Reason).To(Equal(tc.expectedCondition.Reason)) g.Expect(cond.ObservedGeneration).To(Equal(tc.expectedCondition.ObservedGeneration)) if tc.expectedCondition.Message != "" { g.Expect(cond.Message).To(Equal(tc.expectedCondition.Message)) } + + if tc.hcp.Spec.Etcd.Managed != nil && len(tc.hcp.Spec.Etcd.Managed.Storage.RestoreSnapshotURL) > 0 { + restored := meta.FindStatusCondition(tc.hcp.Status.Conditions, string(hyperv1.EtcdSnapshotRestored)) + g.Expect(restored).ToNot(BeNil()) + g.Expect(restored.Status).To(Equal(metav1.ConditionTrue)) + g.Expect(restored.ObservedGeneration).To(Equal(tc.hcp.Generation)) + } }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go` around lines 2780 - 2900, The test case for restore-snapshot should also assert the EtcdSnapshotRestored condition: in the test loop after calling r.reconcileEtcdStatus(t.Context(), tc.hcp) (the HostedControlPlaneReconciler.reconcileEtcdStatus test), when tc.hcp.Spec.Etcd.Managed.Storage.RestoreSnapshotURL is set and the StatefulSet/Pod readiness objects are present, add an assertion that meta.FindStatusCondition(tc.hcp.Status.Conditions, string(hyperv1.EtcdSnapshotRestored)) is not nil and that its Status equals metav1.ConditionTrue (and check ObservedGeneration matches tc.hcp.Generation if desired); place this check alongside the existing EtcdAvailable assertions so the restore-specific expectation fails if restore-status logic regresses.control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)
386-392:⚠️ Potential issue | 🔴 CriticalPrevent
UnauthorizedOperationfrom falling through to the error return.This still returns
destroyErrforUnauthorizedOperationbecause theelseonly belongs to theDependencyViolationcheck. That can leave the finalizer stuck even after logging that deletion is being skipped.Suggested fix
- if code == "UnauthorizedOperation" { - r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of unauthorized operation.") - } - if code == "DependencyViolation" { - r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of dependency violation.") - } else { - return ctrl.Result{}, fmt.Errorf("failed to delete AWS default security group: %w", destroyErr) - } + switch code { + case "UnauthorizedOperation": + r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of unauthorized operation.") + case "DependencyViolation": + r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of dependency violation.") + default: + return ctrl.Result{}, fmt.Errorf("failed to delete AWS default security group: %w", destroyErr) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go` around lines 386 - 392, The two separate if checks for error codes allow the "UnauthorizedOperation" case to fall through into the final error return; change the branching around the AWS default security group deletion to use an if / else if / else (or explicitly return after logging) so that when code == "UnauthorizedOperation" or code == "DependencyViolation" you only log via r.Log.Error(destroyErr, ...) and then return ctrl.Result{}, nil (or otherwise skip returning the fmt.Errorf), and only execute the fmt.Errorf return in the final else; reference the variables and calls in this block (code, destroyErr, r.Log.Error, ctrl.Result{}, fmt.Errorf) to locate and update the logic.control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (3)
3232-3234:⚠️ Potential issue | 🟡 MinorAdd a positive assertion for the cloud-cleanup branch.
The
expectCloudCleanupcases still pass without proving thatdestroyCloudResourcesactually ran. A direct assertion on a cleanup side effect, like theCloudResourcesDestroyedcondition, would catch cleanup-gate regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go` around lines 3232 - 3234, Test currently only asserts no-requeue when expectCloudCleanup is false; add a positive assertion for the branch where tt.expectCloudCleanup is true by asserting the cleanup side-effect performed (e.g., verify the CloudResourcesDestroyed condition was set/updated and/or that destroyCloudResources was invoked). Locate the table-driven test using tt.expectCloudCleanup and add an assertion after reconciliation that when tt.expectCloudCleanup == true the status contains the CloudResourcesDestroyed condition (or the equivalent condition name used in the test) and/or that the mocked destroyCloudResources recorder was called, so the test proves the cleanup path executed.
3216-3230:⚠️ Potential issue | 🟡 MinorVerify the binding independently from the policy.
The table has separate
expectVAPDeletedandexpectVAPBDeletedflags, but the binding assertion is still gated byexpectVAPDeleted. A binding-only regression won't be caught.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go` around lines 3216 - 3230, The binding deletion check is incorrectly tied to expectVAPDeleted; update the test to assert the ValidatingAdmissionPolicyBinding independently using the expectVAPBDeleted flag: when expectVAPBDeleted is true call guestClient.Get on manifests.ValidatingAdmissionPolicyBinding(fmt.Sprintf("%s-binding", registry.AdmissionPolicyNameManagementState)) and assert apierrors.IsNotFound(getErr), and when expectVAPBDeleted is false assert Get returns no error; keep the existing manifests.ValidatingAdmissionPolicy checks controlled by expectVAPDeleted.
2732-2734:⚠️ Potential issue | 🟡 MinorMake the
PATCHassertion fail whenPATCHis missing.The earlier
curlcheck already coverscurl, so this&&still lets a script withoutPATCHpass.Suggested fix
- if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") { + if !strings.Contains(script, "PATCH") { t.Error("Check script should PATCH the ConfigMap on success") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go` around lines 2732 - 2734, The test currently uses if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") which only fails when both are missing; change the check so the test fails when either is missing (e.g. use a logical OR or split into two assertions) so that absence of "PATCH" alone will trigger t.Error; update the condition referencing the script variable and the t.Error call in the test function to assert both "curl" and "PATCH" are present.
🧹 Nitpick comments (8)
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (2)
1841-1844: Prefer order-insensitive assertions for deleted record sets.At Line 1843, using
Equalcouples this test to deletion order rather than behavior.ConsistOfis safer here.Suggested change
- if tt.expectDeleteCalled { - g.Expect(mockRecords.deletedRecordNames).To(Equal(tt.expectedDeletedRecords)) - } + if tt.expectDeleteCalled { + g.Expect(mockRecords.deletedRecordNames).To(ConsistOf(tt.expectedDeletedRecords)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go` around lines 1841 - 1844, The test currently asserts deletion order by comparing mockRecords.deletedRecordNames with tt.expectedDeletedRecords using Equal; change this to an order-insensitive assertion (e.g., Gomega's ConsistOf) so the test checks that the same set of record names were deleted regardless of order; update the assertion that reads g.Expect(mockRecords.deletedRecordNames).To(Equal(tt.expectedDeletedRecords)) to use ConsistOf (or an equivalent order-insensitive matcher) while keeping the surrounding conditional logic that checks mockRecords.deleteCalled.
1850-1905: Strengthen mapping assertions beyond count-only validation.At Line 1904,
HaveLenalone can miss wrong targets. Consider assertingNamespacedNamecontents to verify correct mapping semantics.Suggested change
tests := []struct { name string hcp *hyperv1.HostedControlPlane plsCRs []client.Object expectRequests int + expectedNNs []types.NamespacedName }{ { name: "When HCP has the Azure PLS finalizer and PLS CRs exist, it should return requests for all PLS CRs", @@ plsCRs: []client.Object{ newTestAzurePLS(t, "private-router", "test-ns"), newTestAzurePLS(t, "oauth-openshift", "test-ns"), }, expectRequests: 2, + expectedNNs: []types.NamespacedName{ + {Name: "private-router", Namespace: "test-ns"}, + {Name: "oauth-openshift", Namespace: "test-ns"}, + }, }, @@ requests := mapFn(ctx, tt.hcp) g.Expect(requests).To(HaveLen(tt.expectRequests)) + if len(tt.expectedNNs) > 0 { + actual := make([]types.NamespacedName, 0, len(requests)) + for _, req := range requests { + actual = append(actual, req.NamespacedName) + } + g.Expect(actual).To(ConsistOf(tt.expectedNNs)) + } }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go` around lines 1850 - 1905, The test currently only checks the count of requests returned by mapHCPToAzurePLS; update the assertion to validate the actual NamespacedName targets so wrong mappings aren't missed: after calling mapFn(ctx, tt.hcp) build an expected slice of reconcile.Request (or NamespacedName) from tt.plsCRs (using each object's GetName and GetNamespace, e.g., for newTestAzurePLS entries "private-router", "oauth-openshift") and replace the HaveLen check with a Gomega assertion that the returned requests contain exactly those expected requests (e.g., Expect(requests).To(ConsistOf(...)) or Expect(requests).To(Equal(...))) so the test verifies both count and specific mapped objects; locate this change around the test using AzurePrivateLinkServiceReconciler.mapHCPToAzurePLS and the local variable requests.control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go (1)
830-851: Path assertions are correct; minor improvement for pointer safety.The path assertions using underscores (e.g.,
idp_cm_2_ca,idp_secret_3_client-secret) correctly match the mount path format fromIDPVolumeMountInfo, while the volume name assertions using hyphens (e.g.,idp-cm-2-ca,idp-secret-3-client-secret) correctly match the volume naming format. These tests accurately verify both aspects of the implementation.The pointer dereference on line 850 could be made safer, though
SecretPathalways setsDefaultModeviaptr.To[int32](0640):Safer assertion pattern
g.Expect(vmi.Volumes[0].Secret.SecretName).To(Equal("my-secret")) - g.Expect(*vmi.Volumes[0].Secret.DefaultMode).To(Equal(int32(0640))) + g.Expect(vmi.Volumes[0].Secret.DefaultMode).ToNot(BeNil()) + g.Expect(*vmi.Volumes[0].Secret.DefaultMode).To(Equal(int32(0640)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go` around lines 830 - 851, The pointer dereference in TestIDPVolumeMountInfo_SecretPath is unsafe; before dereferencing vmi.Volumes[0].Secret.DefaultMode (from SecretPath) add an assertion that the pointer is non-nil (e.g., Expect(...).ToNot(BeNil())) and then assert its value, or use a pointer-aware matcher (PointTo/Equal) to check the int32(0640) value; update the test referencing TestIDPVolumeMountInfo_SecretPath and the SecretPath result so the DefaultMode pointer is validated for nil before comparing its dereferenced value.hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go (1)
1380-1466: Assert corrupted-annotation cleanup side effect, not just booleans.At Line 1461 you validate return flags, but the corrupted-input path also patches the Pod to remove
lastSnapshotTimeAnnotation. Adding that assertion would harden this test against regressions.🔧 Suggested test hardening
func TestCheckSnapshotInterval(t *testing.T) { tests := []struct { name string annotations map[string]string minInterval string expectedShouldCreate bool expectedSkip bool + verifyAnnotationRemoved bool }{ @@ { name: "When last snapshot time annotation is corrupted, it should allow snapshot creation", annotations: map[string]string{ lastSnapshotTimeAnnotation: "invalid-time", }, minInterval: "1h", expectedShouldCreate: true, expectedSkip: false, + verifyAnnotationRemoved: true, }, @@ shouldSnapshot, skipReconcile := reconciler.checkSnapshotInterval(context.Background(), pod, spec, logr.Discard()) g.Expect(shouldSnapshot).To(Equal(tt.expectedShouldCreate)) g.Expect(skipReconcile).To(Equal(tt.expectedSkip)) + + if tt.verifyAnnotationRemoved { + updatedPod := &corev1.Pod{} + err := fakeClient.Get(context.Background(), types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}, updatedPod) + g.Expect(err).ToNot(HaveOccurred()) + _, exists := updatedPod.Annotations[lastSnapshotTimeAnnotation] + g.Expect(exists).To(BeFalse(), "Corrupted last snapshot time annotation should be removed") + } }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go` around lines 1380 - 1466, The corrupted-annotation test case should also assert the side-effect that checkSnapshotInterval removes the invalid lastSnapshotTimeAnnotation from the Pod; update TestCheckSnapshotInterval to, for the scenario where annotations[lastSnapshotTimeAnnotation] == "invalid-time", after calling reconciler.checkSnapshotInterval retrieve the Pod from reconciler.client (fake client) and assert that the annotations map no longer contains lastSnapshotTimeAnnotation (or that it's empty/nil). This ensures SnapshotReconciler.checkSnapshotInterval's cleanup behavior is tested in addition to the boolean flags.cmd/infra/azure/create.go (1)
167-313: Add operation context to propagated errors.Several branches now return bare
err(for example Line 177, Line 198, Line 227, Line 290, Line 303, Line 309). Wrapping these with action context will make CLI failures much easier to triage.Proposed pattern
- return "", "", "", err + return "", "", "", fmt.Errorf("failed to get base domain ID for %q: %w", o.BaseDomain, err)- return err + return fmt.Errorf("failed to create private DNS zone link: %w", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/azure/create.go` around lines 167 - 313, Several branches in createNetworkResources, handleIdentitiesAndRBAC, and createDNSAndLBResources return raw errors; wrap those errors with contextual messages. For each call to netMgr.GetBaseDomainID, azureutil.GetNameAndResourceGroupFromNetworkSecurityGroupID, netMgr.CreateSecurityGroup, netMgr.CreateVirtualNetwork, rbacMgr.AssignControlPlaneRoles, rbacMgr.AssignDataPlaneRoles, rbacMgr.AssignWorkloadIdentities, netMgr.CreatePrivateDNSZone, netMgr.CreatePrivateDNSZoneLink, netMgr.CreatePublicIPAddressForLB, and netMgr.CreateLoadBalancer, replace returning bare err with fmt.Errorf("failed to <action> [resource/context]: %w", err) where <action> describes the operation (e.g., "get base domain ID", "parse NSG ID", "create network security group", "create virtual network", "assign control plane roles", "create private DNS zone", "create public IP for LB", etc.) so the functions createNetworkResources, handleIdentitiesAndRBAC, and createDNSAndLBResources propagate errors with clear operation context.hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
810-865: Add a nil-management-network KubeVirt test case.
TestReconcilePlatformNetworkPoliciescurrently always passes a non-nilmanagementClusterNetwork; add a case withplatformType=KubevirtPlatformandmanagementClusterNetwork=nilto protect against regressions in the nil path.Also applies to: 925-939
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` around lines 810 - 865, Add a test case to TestReconcilePlatformNetworkPolicies to exercise the nil managementClusterNetwork path: in the testCases slice add an entry with name indicating "KubeVirt with nil managementClusterNetwork", platformType set to hyperv1.KubevirtPlatform, kubevirtCredentials left nil, managementClusterNetwork explicitly nil, version set (e.g. "4.15.0"), and the expected outcome (expectVirtLauncher true if you want the virt-launcher policy created when no credentials are present). Update the parallel KubeVirt-related test block(s) (the other testCases region referenced around the later lines) the same way so the nil-management-cluster-network code path is covered.control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (1)
1238-1242:t.Parallel()should be called before test setup.The call to
t.Parallel()at line 1241 occurs afterNewGomegaWithT(t)at line 1240. Per Go testing conventions,t.Parallel()should be the first statement in a test or subtest to ensure proper parallel execution setup.🔧 Suggested fix
for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - g := NewGomegaWithT(t) t.Parallel() + g := NewGomegaWithT(t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go` around lines 1238 - 1242, Move the t.Parallel() call to the very start of each subtest so parallel execution is configured before any setup; specifically, in the table-driven loop where you call t.Run for each case, call t.Parallel() as the first statement inside the subtest function (before NewGomegaWithT(t) and any other setup) so that the NewGomegaWithT(t) initialization occurs after the subtest has been marked parallel.control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)
680-687: Duplicate error logging.Lines 680 and 686 both log the same error. Consider removing one to avoid log noise.
♻️ Proposed fix
if err != nil { - log.Error(err, "failed to modify vpc endpoint", "id", endpointID, "addSubnets", addedSubnet, "removeSubnets", removedSubnet, "addSG", addedSG) msg := err.Error() var apiErr smithy.APIError if errors.As(err, &apiErr) { msg = apiErr.ErrorCode() } - log.Error(err, "failed to modify vpc endpoint") + log.Error(err, "failed to modify vpc endpoint", "id", endpointID, "addSubnets", addedSubnet, "removeSubnets", removedSubnet, "addSG", addedSG) return fmt.Errorf("failed to modify vpc endpoint: %s", msg) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go` around lines 680 - 687, The code logs the same error twice in the modify VPC endpoint path; remove the redundant log call so the error is only logged once with full context (keep the detailed log that includes endpointID, addedSubnet, removedSubnet, addedSG and remove the second log.Error call). Update the block around the modify call (referencing variables endpointID, addedSubnet, removedSubnet, addedSG and the log variable) to log once, then build the msg from the error (using errors.As to extract smithy.APIError) and return the formatted error as currently done.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 6139b117-5481-4551-a6a9-00051a22f5b8
📒 Files selected for processing (58)
.golangci.ymlcmd/cluster/core/create.gocmd/cluster/core/create_test.gocmd/fix/dr_oidc_iam.gocmd/infra/aws/create.gocmd/infra/azure/create.gocmd/install/assets/hypershift_operator.gocmd/install/assets/hypershift_operator_test.gocmd/install/install.gocmd/install/install_test.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.gocontrol-plane-operator/main.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.gohypershift-operator/controllers/etcdbackup/reconciler.gohypershift-operator/controllers/etcdbackup/reconciler_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/etcd_recovery_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/internal/platform/platform.gohypershift-operator/controllers/hostedcluster/metrics/metrics.gohypershift-operator/controllers/hostedcluster/network_policies.gohypershift-operator/controllers/hostedcluster/network_policies_test.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.gohypershift-operator/controllers/nodepool/aws.gohypershift-operator/controllers/nodepool/aws_test.gohypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/platform/aws/controller.gohypershift-operator/controllers/platform/aws/controller_test.gohypershift-operator/controllers/scheduler/aws/autoscaler.gohypershift-operator/controllers/scheduler/aws/autoscaler_test.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.gohypershift-operator/main.goignition-server/controllers/local_ignitionprovider.goignition-server/controllers/local_ignitionprovider_test.gotest/e2e/util/dump/journals.gotest/e2e/util/reqserving/verifycp.gotest/e2e/util/reqserving/verifyenv.gotest/e2e/util/util.go
✅ Files skipped from review due to trivial changes (7)
- .golangci.yml
- hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
- cmd/install/assets/hypershift_operator_test.go
- control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
- cmd/install/assets/hypershift_operator.go
- cmd/cluster/core/create.go
🚧 Files skipped from review as they are similar to previous changes (9)
- hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
- hypershift-operator/controllers/nodepool/aws.go
- hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
- control-plane-operator/main.go
- control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
- control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
- cmd/install/install_test.go
- control-plane-operator/controllers/azureprivatelinkservice/controller.go
- cmd/install/install.go
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/fix/dr_oidc_iam.go (1)
452-468:⚠️ Potential issue | 🟠 MajorFail fast when docs are missing in manual mode.
With the new
oidcDocsExist && existsgate, the manual flow now falls through into Step 3 whenever the IAM provider exists but the S3 docs do not. That path can create/reconfigure the bucket and only then fail inensureOIDCDocumentsbecause regenerating docs requires--hc-name. Since this command explicitly advertises manual recovery when theHostedClusteris not accessible, please reject that combination before any live mutations.Possible guard
oidcDocsExist, providerARN, exists, err := o.checkOIDCState(ctx, s3Client, iamClient) if err != nil { return err } + if !oidcDocsExist && o.HostedClusterName == "" && !o.DryRun { + return fmt.Errorf("OIDC documents are missing; rerun with --hc-name/--hc-namespace so they can be regenerated from the cluster signing key") + } if oidcDocsExist && exists && !o.ForceRecreate { fmt.Println("\nAll OIDC components are in place - no action needed!") return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/fix/dr_oidc_iam.go` around lines 452 - 468, The code allows mutating S3 when the IAM provider exists but OIDC docs are missing, which in "manual" mode (no HostedCluster name/--hc-name) will later fail; add an early guard before calling ensureOIDCBucket: if provider exists (exists == true) && oidcDocsExist == false && o.ForceRecreate == false && the command is in manual/no-hosted-cluster mode (e.g. o.HCName is empty or equivalent), return an error instructing the user to supply --hc-name or use --force-recreate instead of proceeding. Place this check after checkOIDCState(...) and before ensureOIDCBucket(...), referencing checkOIDCState, oidcDocsExist, exists, o.ForceRecreate, and ensureOIDCBucket/ensureOIDCDocuments.
♻️ Duplicate comments (8)
cmd/infra/aws/create.go (2)
385-411:⚠️ Potential issue | 🟠 MajorEnsure temporary VPC cleanup runs on all exit paths.
If
CreatePrivateZonefails aftercreateVPC,initialVPCis leaked because deletion only happens at the end. Register cleanup withdeferimmediately after VPC creation.Proposed fix
-func (o *CreateInfraOptions) createDNSZones(ctx context.Context, l logr.Logger, clusterCreatorEC2Client awsapi.EC2API, route53Client, vpcOwnerRoute53Client awsapi.ROUTE53API, result *CreateInfraOutput) error { +func (o *CreateInfraOptions) createDNSZones(ctx context.Context, l logr.Logger, clusterCreatorEC2Client awsapi.EC2API, route53Client, vpcOwnerRoute53Client awsapi.ROUTE53API, result *CreateInfraOutput) (err error) { privateZoneClient := vpcOwnerRoute53Client var initialVPC string - var err error if o.PrivateZonesInClusterAccount { privateZoneClient = route53Client if initialVPC, err = o.createVPC(ctx, l, clusterCreatorEC2Client); err != nil { return err } + defer func() { + if initialVPC == "" { + return + } + if cleanupErr := o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC); cleanupErr != nil { + if err == nil { + err = cleanupErr + } else { + l.Error(cleanupErr, "failed to delete temporary VPC", "vpcID", initialVPC) + } + } + }() } @@ - if initialVPC != "" { - if err := o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC); err != nil { - return err - } - } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/aws/create.go` around lines 385 - 411, createDNSZones may leak initialVPC if createVPC succeeds but a subsequent CreatePrivateZone fails; after calling createVPC (inside the o.PrivateZonesInClusterAccount branch) immediately register a defer that calls o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC) only when initialVPC is non-empty, and ensure the deferred delete does not overwrite the function's returned error (log or wrap the delete error instead). In other words, after createVPC returns set initialVPC and then defer a safe cleanup that references initialVPC and handles/delete errors without clobbering the outer error, leaving the rest of createDNSZones (calls to CreatePrivateZone) unchanged.
255-256:⚠️ Potential issue | 🔴 CriticalPass a public subnet into the proxy-host path.
Line 428 currently launches the proxy host into
result.Zones[0].SubnetID(private subnet in normal flow). That breaks public IP/SSH-based setup for non-PublicOnlyclusters.Proposed fix
- if err := o.createProxyResources(ctx, l, ec2Client, result); err != nil { + if err := o.createProxyResources(ctx, l, ec2Client, result, publicSubnetIDs[0]); err != nil { return nil, err }-func (o *CreateInfraOptions) createProxyResources(ctx context.Context, l logr.Logger, ec2Client awsapi.EC2API, result *CreateInfraOutput) error { +func (o *CreateInfraOptions) createProxyResources(ctx context.Context, l logr.Logger, ec2Client awsapi.EC2API, result *CreateInfraOutput, publicSubnetID string) error { @@ - proxyResult, err := o.createProxyHost(ctx, l, ec2Client, result.Zones[0].SubnetID, sgGroupID, o.EnableSecureProxy) + proxyResult, err := o.createProxyHost(ctx, l, ec2Client, publicSubnetID, sgGroupID, o.EnableSecureProxy) if err != nil { return fmt.Errorf("failed to create proxy host: %w", err) }Also applies to: 413-438
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/infra/aws/create.go` around lines 255 - 256, The proxy host is being launched into result.Zones[0].SubnetID (a private subnet) which prevents obtaining a public IP/SSH access; update createProxyResources (and the proxy-host creation path in the same block that iterates result.Zones) to use the zone's public subnet field instead — e.g., choose Zone.PublicSubnetID (or the equivalent public-subnet property on each element of result.Zones) when provisioning the proxy host unless the cluster is explicitly PublicOnly; ensure the subnet ID passed to the proxy host creation call is the public subnet for that zone so the instance gets a public IP.hypershift-operator/controllers/hostedcluster/metrics/metrics.go (1)
267-287:⚠️ Potential issue | 🟠 MajorReturn immediately after the HostedCluster list fails.
After the error on Line 268,
Collectstill emits zeroed aggregates and advanceslastCollectTimeon Line 287. That turns a transient API failure into “0 clusters” and can permanently drop transition observations from the failed interval.Suggested fix
hclusters := &hyperv1.HostedClusterList{} if err := c.List(context.Background(), hclusters); err != nil { log.Error(err, "failed to list hosted clusters while collecting metrics") + return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go` around lines 267 - 287, The HostedCluster list error handling in Collect is incomplete: after c.List(context.Background(), hclusters) fails you must return immediately instead of continuing to emit zeroed aggregates and updating c.lastCollectTime; update the Collect function so that when the List call returns an error (the log.Error(err, "failed to list hosted clusters while collecting metrics") path) the function exits (return) and does not call emitAggregatedMetrics, c.transitionDurationMetric.Collect, c.collectPerClusterMetrics, or update c.lastCollectTime, preserving previous metrics and transition observations.hypershift-operator/controllers/etcdbackup/reconciler.go (1)
165-168:⚠️ Potential issue | 🟠 MajorClean up temporary resources on job-creation NotFound path.
At Line 165, this path marks backup failed but leaves RBAC/NetworkPolicy created earlier in
createResourcesAndJobuntil a later reconcile. That leaves elevated access longer than needed.Suggested fix
if err := r.createBackupJob(ctx, backup, hcp); err != nil { if apierrors.IsNotFound(err) { + if cleanupErr := r.cleanupResources(ctx, backup); cleanupErr != nil { + logger.Error(cleanupErr, "failed to cleanup resources after backup job creation failure") + } return r.setFailedConditionAndUpdate(ctx, backup, hyperv1.BackupFailedReason, err.Error()) } return ctrl.Result{}, fmt.Errorf("failed to create backup Job: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 165 - 168, The NotFound error path in the createBackupJob call marks the Backup failed but leaves temporary RBAC/NetworkPolicy created earlier by createResourcesAndJob; modify the NotFound branch so before returning from setFailedConditionAndUpdate you clean up those temporary resources (e.g., invoke the same cleanup logic used on other failure paths or add a helper like deleteTemporaryResourcesForBackup(ctx, backup) that deletes the Role/RoleBinding/NetworkPolicy and ignores NotFound errors), ensuring the cleanup is idempotent and errors from cleanup are logged/handled before calling setFailedConditionAndUpdate.control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)
821-825:⚠️ Potential issue | 🟠 MajorAdd namespace filter to List operation to prevent cross-namespace deletion.
Line 823 lists services with
externalPrivateServiceLabelacross all namespaces. Since services are created inhcp.Namespace(Lines 805-808), the deletion loop at Lines 829-833 could inadvertently remove services belonging to other HCPs in different namespaces that share the same label.♻️ Proposed fix
privateExternalServices := &corev1.ServiceList{} - if err := r.List(ctx, privateExternalServices, client.HasLabels{externalPrivateServiceLabel}); err != nil { + if err := r.List(ctx, privateExternalServices, client.InNamespace(hcp.Namespace), client.HasLabels{externalPrivateServiceLabel}); err != nil { return fmt.Errorf("cannot list private external services: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go` around lines 821 - 825, The List call that fetches privateExternalServices uses only client.HasLabels{externalPrivateServiceLabel} and thus queries across all namespaces; update the r.List invocation to include a namespace filter e.g. add client.InNamespace(hcp.Namespace) alongside the existing client.HasLabels option so privateExternalServices only contains services in hcp.Namespace (this ensures the subsequent deletion loop over privateExternalServices.Items only affects services in the target HCP namespace).control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (2)
583-586:⚠️ Potential issue | 🟠 MajorKeep this helper in collect-and-continue mode on registry read failures.
Returning here skips ingress reconciliation and OIDC secret sync for the whole pass. That reintroduces fail-fast behavior inside a helper that otherwise aggregates errors.
Suggested fix
if err := r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig); err != nil { if !apierrors.IsNotFound(err) { - return []error{fmt.Errorf("failed to get registry config: %w", err)} + errs = append(errs, fmt.Errorf("failed to get registry config: %w", err)) } } else { registryConfigExists = true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 583 - 586, The helper currently returns immediately when r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig) fails (non-NotFound), which short-circuits the rest of the pass; instead, change this to collect the error and continue: on non-NotFound errors append fmt.Errorf("failed to get registry config: %w", err) to the helper's error-collection slice (e.g., errs) and avoid returning, so ingress reconciliation and OIDC secret sync still run for this pass while the error is aggregated.
427-429:⚠️ Potential issue | 🟠 MajorDon't mask accumulated errors behind the recovery requeue.
If recovery asks for a requeue, this branch exits before
errsis aggregated, so failures from earlier reconciliation steps are silently dropped.Suggested fix
- if result, err := r.reconcileClusterRecovery(ctx, log, hcp, errs); err != nil || result.RequeueAfter > 0 { - return result, err - } - - return ctrl.Result{}, utilerrors.NewAggregate(errs) + if result, err := r.reconcileClusterRecovery(ctx, log, hcp, errs); err != nil { + return result, err + } else if result.RequeueAfter > 0 { + if len(errs) == 0 { + return result, nil + } + return ctrl.Result{}, utilerrors.NewAggregate(errs) + } + + return ctrl.Result{}, utilerrors.NewAggregate(errs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 427 - 429, The current early return when reconcileClusterRecovery(ctx, log, hcp, errs) requests a requeue or returns an error drops previously accumulated errors in errs; instead, call reconcileClusterRecovery, append its error (if any) to the errs slice, and before returning on result.RequeueAfter > 0 aggregate and return errs (e.g., using errors.NewAggregate or the project's error-aggregation helper) so earlier reconciliation failures are not masked; reference reconcileClusterRecovery, the result variable, and errs when making this change.control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)
826-830:⚠️ Potential issue | 🟠 MajorClear
Status.KubeadminPasswordwhen the secret is missing.On
IsNotFound, this leaves any previously published ref untouched, so status can continue pointing at a deletedkubeadmin-passwordsecret.Suggested fix
if err := r.Get(ctx, client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword); err != nil { if !apierrors.IsNotFound(err) { return fmt.Errorf("failed to get kubeadmin password: %w", err) } + hostedControlPlane.Status.KubeadminPassword = nil } else { hostedControlPlane.Status.KubeadminPassword = &corev1.LocalObjectReference{ Name: kubeadminPassword.Name, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go` around lines 826 - 830, When r.Get(ctx, client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword) returns apierrors.IsNotFound, clear the published reference on the HostedControlPlane status (e.g. set hostedControlPlane.Status.KubeadminPassword = "" or nil) and persist that change with r.Status().Update(ctx, hostedControlPlane), handling any update error; keep the existing behavior for non-NotFound errors and for the else branch where the secret exists. Ensure you reference the kubeadminPassword variable, r.Get call, and Status.KubeadminPassword field when making the change.
🧹 Nitpick comments (3)
cmd/cluster/core/create.go (1)
347-352: Avoid duplicate KAS DNS validation in construction path.
KubeAPIServerDNSNameis already validated invalidateCapabilities(Line 838+). Re-validating here increases drift risk; keep one validation source and just assign here.Refactor suggestion
if len(opts.KubeAPIServerDNSName) > 0 { - if err := validation.IsDNS1123Subdomain(opts.KubeAPIServerDNSName); len(err) > 0 { - return nil, fmt.Errorf("KubeAPIServerDNSName failed DNS validation: %s", strings.Join(err[:], " ")) - } prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cluster/core/create.go` around lines 347 - 352, Remove the duplicate DNS validation here: instead of calling validation.IsDNS1123Subdomain on opts.KubeAPIServerDNSName in the create path, simply assign opts.KubeAPIServerDNSName to prototype.Cluster.Spec.KubeAPIServerDNSName when non-empty; rely on the existing validation already performed in validateCapabilities (which validates KubeAPIServerDNSName), so delete the validation block around opts.KubeAPIServerDNSName in the create logic and keep only the assignment guard (if len(opts.KubeAPIServerDNSName) > 0 { prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName }).hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)
686-686: Standardize ont.Context()for consistency.This test file mixes
context.Background()(lines 119, 561, 1445) witht.Context()(lines 686, 798, 925, 1070, 1149, 1224, 1299). Since the project targets Go 1.25.3 (which supportst.Context()without issue) andt.Context()is already used elsewhere in the codebase, consider usingt.Context()consistently throughout this file for modern Go testing practices.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` at line 686, Replace uses of context.Background() in this test file with t.Context() for consistency; locate the test functions that call reconciler.getManagementClusterNetwork and other helper/verification calls and change their ctx argument from context.Background() to t.Context() so all tests use the same testing context; ensure imports remain unchanged and run `go test ./...` to confirm no context-related failures.control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (1)
1238-1242: Minor: Inconsistentt.Parallel()placement.In this test,
NewGomegaWithT(t)is called beforet.Parallel(), while other table-driven tests in this file (e.g., lines 653, 680) create Gomega after declaring the test parallel. Consider reordering for consistency:♻️ Suggested reorder
for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - g := NewGomegaWithT(t) t.Parallel() + g := NewGomegaWithT(t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go` around lines 1238 - 1242, Move the call to t.Parallel() to run before creating the Gomega instance so the test goroutine is marked parallel immediately; specifically, inside the table-driven loop over tests (the t.Run closure that currently calls NewGomegaWithT(t) then t.Parallel()), call t.Parallel() first and then create g := NewGomegaWithT(t) to match the other table-driven tests and avoid racey Gomega setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/cluster/core/create.go`:
- Around line 373-381: The parseKeyValuePairs function currently accepts entries
like "=value" which produce empty keys; update parseKeyValuePairs to validate
that pair[0] (the key) is non-empty after splitting and return a clear error
(e.g., fmt.Errorf("invalid %s: %s (empty key)", kind, s)) when it is empty so
CLI parsing of --annotations/--labels fails fast and prevents invalid metadata
downstream.
- Around line 468-476: The code currently returns early when opts.PausedUntil ==
"true", so cluster.Spec.PausedUntil never gets set; change the logic in the
handler that processes opts.PausedUntil so that an empty string still returns
nil, but when opts.PausedUntil == "true" you set cluster.Spec.PausedUntil =
&opts.PausedUntil and return nil; otherwise validate with
time.Parse(time.RFC3339, opts.PausedUntil) and on success assign
cluster.Spec.PausedUntil = &opts.PausedUntil (and on parse error return the
formatted error). Ensure you update the block referencing opts.PausedUntil and
cluster.Spec.PausedUntil accordingly.
In `@cmd/fix/dr_oidc_iam.go`:
- Around line 496-510: The S3 probe errors are being swallowed by
checkOIDCDocumentsExist; change checkOIDCDocumentsExist to return (bool, error)
and update checkOIDCState to call it accordingly (use oidcDocsExist, err :=
o.checkOIDCDocumentsExist(ctx, s3Client)); if err != nil, return that error
immediately unless it's an explicit NotFound/NoSuchKey condition which you
should translate to oidcDocsExist=false; preserve existing messaging around
o.ForceRecreate and only treat absence (not transient/access errors) as missing
so you do not proceed to mutate bucket/provider on S3 probe failures; update any
other callers of checkOIDCDocumentsExist to handle the new (bool, error)
signature.
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 801-805: The code currently skips calling
controlPlaneComponentsAvailable once HostedControlPlaneAvailable is true
(variables availableCondition / alreadyAvailable), which prevents flipping
availability back when components later fail; remove the guard that checks
alreadyAvailable and always call r.controlPlaneComponentsAvailable(ctx,
hostedControlPlane), then use its returned componentsErr /
componentsNotAvailableMsg to set or update the HostedControlPlaneAvailable
condition (setting False with the returned message/error when components are
unavailable and True when they are all available) so availability is
re-evaluated on each reconciliation.
- Around line 724-759: When reconcileInfrastructureStatus (called as
infraStatus, err := r.reconcileInfrastructureStatus(...)) returns an error or
infraStatus.IsReady() is false, clear any previously-set infrastructure-derived
fields so stale endpoints aren't advertised: explicitly reset
hostedControlPlane.Status.ControlPlaneEndpoint (empty Host and zero Port) and
hostedControlPlane.Status.OAuthCallbackURLTemplate to empty strings in the error
branch and the "not ready" branch (the else block), leaving the current
assignments only in the infraStatus.IsReady() branch; ensure these resets occur
before setting newCondition and logging so the status accurately reflects the
lack of a reachable control plane.
- Around line 587-590: The reconcile adds a finalizer to hostedControlPlane
using controllerutil.AddFinalizer and then patches with r.Patch using
originalHCP as the merge base, but subsequent optimistic-lock status patches
(e.g., in reconcileControlPlaneVersionStatus) still use the stale
originalHCP/resourceVersion; fix by either returning immediately after
successfully adding and patching the finalizer (so later work runs on next
reconcile with fresh object) or refresh the patch base by reloading the
hostedControlPlane (e.g., DeepCopyFrom the live object or re-Get) before any
further MergeFromWithOptions/optimistic-lock status patches so their
resourceVersion is current.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 773-789: The error-wrapping strings are using the wrong service
name ("openshift apiserver") for the OAuth-specific helpers; update the
fmt.Errorf messages in the reconcile block around
reconcileOpenshiftOAuthAPIServerAPIServices, the CreateOrUpdate call for
openshiftOAuthAPIServerService, and reconcileOpenshiftOAuthAPIServerEndpoints so
they reference "openshift oauth apiserver" (or "openshift oauth apiserver
service/endpoints" as appropriate) to match the actual reconcile functions and
variables (reconcileOpenshiftOAuthAPIServerAPIServices,
openshiftOAuthAPIServerService, reconcileOpenshiftOAuthAPIServerEndpoints).
In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go`:
- Around line 48-55: The code calls handleExistingEtcdRecoveryJob when
jobStatus.exists but then always proceeds to detectAndTriggerEtcdRecovery, which
can overwrite terminal EtcdRecoveryActive=False and retrigger recovery; change
the control flow so that when jobStatus.exists you short-circuit after
handleExistingEtcdRecoveryJob: if handleExistingEtcdRecoveryJob returns an error
propagate it, otherwise return immediately (do not call
detectAndTriggerEtcdRecovery) to avoid re-entering creation logic for already
completed/failed recovery jobs; update the block around jobStatus.exists,
handleExistingEtcdRecoveryJob, and the subsequent call to
detectAndTriggerEtcdRecovery accordingly.
- Around line 92-98: The current gating only updates the condition when
meta.FindStatusCondition(hcluster.Status.Conditions,
string(hyperv1.EtcdRecoveryActive)) returns nil or oldCondition.Status !=
condition.Status, which misses changes in Reason/Message; change the predicate
so you call meta.SetStatusCondition(&hcluster.Status.Conditions, condition) (and
then r.Client.Status().Update(ctx, hcluster)) whenever oldCondition is nil OR
any of oldCondition.Status, oldCondition.Reason, or oldCondition.Message differs
from condition.Status, condition.Reason, or condition.Message respectively, so
the EtcdRecoveryActive condition is replaced when reason or message changes as
well.
---
Outside diff comments:
In `@cmd/fix/dr_oidc_iam.go`:
- Around line 452-468: The code allows mutating S3 when the IAM provider exists
but OIDC docs are missing, which in "manual" mode (no HostedCluster
name/--hc-name) will later fail; add an early guard before calling
ensureOIDCBucket: if provider exists (exists == true) && oidcDocsExist == false
&& o.ForceRecreate == false && the command is in manual/no-hosted-cluster mode
(e.g. o.HCName is empty or equivalent), return an error instructing the user to
supply --hc-name or use --force-recreate instead of proceeding. Place this check
after checkOIDCState(...) and before ensureOIDCBucket(...), referencing
checkOIDCState, oidcDocsExist, exists, o.ForceRecreate, and
ensureOIDCBucket/ensureOIDCDocuments.
---
Duplicate comments:
In `@cmd/infra/aws/create.go`:
- Around line 385-411: createDNSZones may leak initialVPC if createVPC succeeds
but a subsequent CreatePrivateZone fails; after calling createVPC (inside the
o.PrivateZonesInClusterAccount branch) immediately register a defer that calls
o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC) only when initialVPC is
non-empty, and ensure the deferred delete does not overwrite the function's
returned error (log or wrap the delete error instead). In other words, after
createVPC returns set initialVPC and then defer a safe cleanup that references
initialVPC and handles/delete errors without clobbering the outer error, leaving
the rest of createDNSZones (calls to CreatePrivateZone) unchanged.
- Around line 255-256: The proxy host is being launched into
result.Zones[0].SubnetID (a private subnet) which prevents obtaining a public
IP/SSH access; update createProxyResources (and the proxy-host creation path in
the same block that iterates result.Zones) to use the zone's public subnet field
instead — e.g., choose Zone.PublicSubnetID (or the equivalent public-subnet
property on each element of result.Zones) when provisioning the proxy host
unless the cluster is explicitly PublicOnly; ensure the subnet ID passed to the
proxy host creation call is the public subnet for that zone so the instance gets
a public IP.
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 821-825: The List call that fetches privateExternalServices uses
only client.HasLabels{externalPrivateServiceLabel} and thus queries across all
namespaces; update the r.List invocation to include a namespace filter e.g. add
client.InNamespace(hcp.Namespace) alongside the existing client.HasLabels option
so privateExternalServices only contains services in hcp.Namespace (this ensures
the subsequent deletion loop over privateExternalServices.Items only affects
services in the target HCP namespace).
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 826-830: When r.Get(ctx,
client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword) returns
apierrors.IsNotFound, clear the published reference on the HostedControlPlane
status (e.g. set hostedControlPlane.Status.KubeadminPassword = "" or nil) and
persist that change with r.Status().Update(ctx, hostedControlPlane), handling
any update error; keep the existing behavior for non-NotFound errors and for the
else branch where the secret exists. Ensure you reference the kubeadminPassword
variable, r.Get call, and Status.KubeadminPassword field when making the change.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 583-586: The helper currently returns immediately when
r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig)
fails (non-NotFound), which short-circuits the rest of the pass; instead, change
this to collect the error and continue: on non-NotFound errors append
fmt.Errorf("failed to get registry config: %w", err) to the helper's
error-collection slice (e.g., errs) and avoid returning, so ingress
reconciliation and OIDC secret sync still run for this pass while the error is
aggregated.
- Around line 427-429: The current early return when
reconcileClusterRecovery(ctx, log, hcp, errs) requests a requeue or returns an
error drops previously accumulated errors in errs; instead, call
reconcileClusterRecovery, append its error (if any) to the errs slice, and
before returning on result.RequeueAfter > 0 aggregate and return errs (e.g.,
using errors.NewAggregate or the project's error-aggregation helper) so earlier
reconciliation failures are not masked; reference reconcileClusterRecovery, the
result variable, and errs when making this change.
In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 165-168: The NotFound error path in the createBackupJob call marks
the Backup failed but leaves temporary RBAC/NetworkPolicy created earlier by
createResourcesAndJob; modify the NotFound branch so before returning from
setFailedConditionAndUpdate you clean up those temporary resources (e.g., invoke
the same cleanup logic used on other failure paths or add a helper like
deleteTemporaryResourcesForBackup(ctx, backup) that deletes the
Role/RoleBinding/NetworkPolicy and ignores NotFound errors), ensuring the
cleanup is idempotent and errors from cleanup are logged/handled before calling
setFailedConditionAndUpdate.
In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go`:
- Around line 267-287: The HostedCluster list error handling in Collect is
incomplete: after c.List(context.Background(), hclusters) fails you must return
immediately instead of continuing to emit zeroed aggregates and updating
c.lastCollectTime; update the Collect function so that when the List call
returns an error (the log.Error(err, "failed to list hosted clusters while
collecting metrics") path) the function exits (return) and does not call
emitAggregatedMetrics, c.transitionDurationMetric.Collect,
c.collectPerClusterMetrics, or update c.lastCollectTime, preserving previous
metrics and transition observations.
---
Nitpick comments:
In `@cmd/cluster/core/create.go`:
- Around line 347-352: Remove the duplicate DNS validation here: instead of
calling validation.IsDNS1123Subdomain on opts.KubeAPIServerDNSName in the create
path, simply assign opts.KubeAPIServerDNSName to
prototype.Cluster.Spec.KubeAPIServerDNSName when non-empty; rely on the existing
validation already performed in validateCapabilities (which validates
KubeAPIServerDNSName), so delete the validation block around
opts.KubeAPIServerDNSName in the create logic and keep only the assignment guard
(if len(opts.KubeAPIServerDNSName) > 0 {
prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName }).
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 1238-1242: Move the call to t.Parallel() to run before creating
the Gomega instance so the test goroutine is marked parallel immediately;
specifically, inside the table-driven loop over tests (the t.Run closure that
currently calls NewGomegaWithT(t) then t.Parallel()), call t.Parallel() first
and then create g := NewGomegaWithT(t) to match the other table-driven tests and
avoid racey Gomega setup.
In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Line 686: Replace uses of context.Background() in this test file with
t.Context() for consistency; locate the test functions that call
reconciler.getManagementClusterNetwork and other helper/verification calls and
change their ctx argument from context.Background() to t.Context() so all tests
use the same testing context; ensure imports remain unchanged and run `go test
./...` to confirm no context-related failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 9dfb8731-f9fa-4ee6-97da-1708a77b39c0
📒 Files selected for processing (58)
.golangci.ymlcmd/cluster/core/create.gocmd/cluster/core/create_test.gocmd/fix/dr_oidc_iam.gocmd/infra/aws/create.gocmd/infra/azure/create.gocmd/install/assets/hypershift_operator.gocmd/install/assets/hypershift_operator_test.gocmd/install/install.gocmd/install/install_test.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.gocontrol-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller.gocontrol-plane-operator/controllers/azureprivatelinkservice/controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.gocontrol-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.gocontrol-plane-operator/main.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.gohypershift-operator/controllers/etcdbackup/reconciler.gohypershift-operator/controllers/etcdbackup/reconciler_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/etcd_recovery_test.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller.gohypershift-operator/controllers/hostedcluster/internal/platform/platform.gohypershift-operator/controllers/hostedcluster/metrics/metrics.gohypershift-operator/controllers/hostedcluster/network_policies.gohypershift-operator/controllers/hostedcluster/network_policies_test.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.gohypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.gohypershift-operator/controllers/nodepool/aws.gohypershift-operator/controllers/nodepool/aws_test.gohypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/metrics/metrics.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/platform/aws/controller.gohypershift-operator/controllers/platform/aws/controller_test.gohypershift-operator/controllers/scheduler/aws/autoscaler.gohypershift-operator/controllers/scheduler/aws/autoscaler_test.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.gohypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.gohypershift-operator/main.goignition-server/controllers/local_ignitionprovider.goignition-server/controllers/local_ignitionprovider_test.gotest/e2e/util/dump/journals.gotest/e2e/util/reqserving/verifycp.gotest/e2e/util/reqserving/verifyenv.gotest/e2e/util/util.go
✅ Files skipped from review due to trivial changes (8)
- .golangci.yml
- hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
- hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
- cmd/infra/azure/create.go
- cmd/install/assets/hypershift_operator_test.go
- cmd/cluster/core/create_test.go
- control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
- control-plane-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (10)
- control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
- hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
- control-plane-operator/controllers/azureprivatelinkservice/controller.go
- control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
- control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
- hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
- cmd/install/install.go
- cmd/install/assets/hypershift_operator.go
- cmd/install/install_test.go
- control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
9e90018 to
ad30a1c
Compare
|
/pipeline required |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
|
/retest |
|
/test images |
|
/lgtm |
|
Hey @cblecker 👋🏻 - do you mind adding the /lgtm again? It didn't take on Friday. |
|
/lgtm |
|
Scheduling tests matching the |
|
/verified by e2e |
|
@bryan-cox: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/retest |
1 similar comment
|
/retest |
|
/test e2e-aws |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth test failures ( Root CauseThe root cause is an AWS infrastructure resource cleanup timeout during hosted cluster teardown — a known flaky condition in HyperShift e2e tests:
This failure is NOT related to PR #8309. The PR exclusively modifies Recommendations
Evidence
|
|
@bryan-cox: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Restore and improve comments that document hidden constraints, edge cases, and design decisions across files affected by the gocyclo complexity refactor in PR openshift#8309. Changes: - Add missing requeue comment in AWS private link deletion flow - Restore OAuth IDP conversion comments: GitLab OIDC requirement, Keystone ID, IBM Cloud special case, ID claim safety, challenge flow validation constraint - Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP - Add ARO-HCP OpenAPI spec link for Azure resource type constant - Restore CAPI label propagation doc comment with TODO and JIRA link (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades - Restore globalPS DaemonSet scheduling comment Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore and improve comments that document hidden constraints, edge cases, and design decisions across files affected by the gocyclo complexity refactor in PR openshift#8309. Changes: - Add missing requeue comment in AWS private link deletion flow - Restore OAuth IDP conversion comments: GitLab OIDC requirement, Keystone ID, IBM Cloud special case, ID claim safety, challenge flow validation constraint - Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP - Add ARO-HCP OpenAPI spec link for Azure resource type constant - Restore CAPI label propagation doc comment with TODO and JIRA link (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades - Restore globalPS DaemonSet scheduling comment Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore and improve comments that document hidden constraints, edge cases, and design decisions across files affected by the gocyclo complexity refactor in PR openshift#8309. Changes: - Add missing requeue comment in AWS private link deletion flow - Restore OAuth IDP conversion comments: GitLab OIDC requirement, Keystone ID, IBM Cloud special case, ID claim safety, challenge flow validation constraint - Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP - Add ARO-HCP OpenAPI spec link for Azure resource type constant - Restore CAPI label propagation doc comment with TODO and JIRA link (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades - Restore globalPS DaemonSet scheduling comment Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore and improve comments that document hidden constraints, edge cases, and design decisions across files affected by the gocyclo complexity refactor in PR openshift#8309. Changes: - Add missing requeue comment in AWS private link deletion flow - Restore OAuth IDP conversion comments: GitLab OIDC requirement, Keystone ID, IBM Cloud special case, ID claim safety, challenge flow validation constraint - Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP - Add ARO-HCP OpenAPI spec link for Azure resource type constant - Restore CAPI label propagation doc comment with TODO and JIRA link (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades - Restore globalPS DaemonSet scheduling comment Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore and improve comments that document hidden constraints, edge cases, and design decisions across files affected by the gocyclo complexity refactor in PR openshift#8309. Changes: - Add missing requeue comment in AWS private link deletion flow - Restore OAuth IDP conversion comments: GitLab OIDC requirement, Keystone ID, IBM Cloud special case, ID claim safety, challenge flow validation constraint - Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP - Add ARO-HCP OpenAPI spec link for Azure resource type constant - Restore CAPI label propagation doc comment with TODO and JIRA link (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades - Restore globalPS DaemonSet scheduling comment Signed-off-by: Bryan Cox <brcox@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What this PR does / why we need it:
Enables the
gocyclolinter in golangci-lint to enforce a maximum cyclomatic complexity threshold of 30 across the codebase. The defaultgocyclothreshold is 10, but 30 was chosen as a pragmatic starting point that catches the most egregious complexity while avoiding excessive churn. This improves code maintainability by ensuring functions stay focused and testable.Changes:
gocycloto the enabled linters in.golangci.ymlwith complexity threshold 30//nolint:gocycloexceptions for 4 inherently complex reconciliation functions:HostedClusterReconciler.reconcile(complexity 318) — core HC reconciliation loopHostedClusterReconciler.delete(complexity 41) — HC deletion handlerGetPlatform(complexity 35) — platform factory switchNodePoolReconciler.reconcile(complexity 33) — core NP reconciliation loopunparamlint warnings introduced by the refactoring (removed unused parameters, simplified return types)NewCommonRegistryProviderexposed by the refactoringBug fixes (discovered during refactoring):
awsprivatelink_controller.go: VpcEndpoints slice was indexed before bounds check, causing potential panic whenDescribeVpcEndpointsreturns an empty slicehostedcontrolplane_controller.go:UnauthorizedOperationerror code inreconcileDeletionfell throughif/elseto the error return branch instead of being handled gracefully likeDependencyViolationoauth/idp_convert.go: BasicAuth TLS cert and key mounts collided (both used the same volume namev4-0-config-system-ocp-branding-template)network_policies.go: nil dereference inreconcileVirtLauncherNetworkPolicywhenCapabilityNetworkswas absent from the infrastructure cluster network configIntentional behavioral changes:
awsprivatelink_controller.go: Public-endpoint cleanup now scopes theListtohcp.Namespace(previously listed cluster-wide). These services are only ever created inhcp.Namespace, so this prevents accidental cross-namespace deletes.hostedcontrolplane_controller.go:reconcileKubeadminPasswordStatusnow explicitly nilsStatus.KubeadminPasswordwhen the secret is NotFound. The original code left stale references; this clears them.cmd/cluster/core/create.go:parseKeyValuePairsnow rejects entries with an empty key (e.g.=value). The original code silently created an entry with an empty string key.Verification:
make lintpasses with 0 issuesmake testpasses with 0 failuresmake verify-codespellpasses with 0 issuesgo build ./...compiles cleanlyWhich issue(s) this PR fixes:
CNTRLPLANE-3324
Special notes for your reviewer:
This PR is primarily refactoring (complexity reduction via helper extraction), but it also includes four bug fixes that were discovered during the refactoring. The bug fixes are listed explicitly above and each has dedicated test coverage:
TestReconcileExistingEndpoint— covers the VpcEndpoints bounds checkTestReconcileDeletion— covers theUnauthorizedOperationswitch fixTestConvertBasicAuthIdentityProvider— covers the cert/key mount collisionTestFetchInfraClusterNetwork— covers the namespace fix and error propagation in network policiesThe 4 nolint exceptions are for functions whose complexity is inherent to their role as top-level orchestrators. Breaking them down further would obscure the reconciliation flow rather than improve it.
Checklist:
AI Assistance