Skip to content

CNTRLPLANE-3324: Enable gocyclo linter and reduce cyclomatic complexity#8309

Merged
openshift-merge-bot[bot] merged 7 commits into
openshift:mainfrom
bryan-cox:cyclo-magic
May 11, 2026
Merged

CNTRLPLANE-3324: Enable gocyclo linter and reduce cyclomatic complexity#8309
openshift-merge-bot[bot] merged 7 commits into
openshift:mainfrom
bryan-cox:cyclo-magic

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented Apr 22, 2026

What this PR does / why we need it:

Enables the gocyclo linter in golangci-lint to enforce a maximum cyclomatic complexity threshold of 30 across the codebase. The default gocyclo threshold is 10, but 30 was chosen as a pragmatic starting point that catches the most egregious complexity while avoiding excessive churn. This improves code maintainability by ensuring functions stay focused and testable.

Changes:

  • Added gocyclo to the enabled linters in .golangci.yml with complexity threshold 30
  • Refactored ~35 files to extract focused helper methods from functions exceeding the complexity threshold
  • Added //nolint:gocyclo exceptions for 4 inherently complex reconciliation functions:
    • HostedClusterReconciler.reconcile (complexity 318) — core HC reconciliation loop
    • HostedClusterReconciler.delete (complexity 41) — HC deletion handler
    • GetPlatform (complexity 35) — platform factory switch
    • NodePoolReconciler.reconcile (complexity 33) — core NP reconciliation loop
  • Fixed all new unparam lint warnings introduced by the refactoring (removed unused parameters, simplified return types)
  • Fixed a missing error check for NewCommonRegistryProvider exposed by the refactoring

Bug fixes (discovered during refactoring):

  • awsprivatelink_controller.go: VpcEndpoints slice was indexed before bounds check, causing potential panic when DescribeVpcEndpoints returns an empty slice
  • hostedcontrolplane_controller.go: UnauthorizedOperation error code in reconcileDeletion fell through if/else to the error return branch instead of being handled gracefully like DependencyViolation
  • oauth/idp_convert.go: BasicAuth TLS cert and key mounts collided (both used the same volume name v4-0-config-system-ocp-branding-template)
  • network_policies.go: nil dereference in reconcileVirtLauncherNetworkPolicy when CapabilityNetworks was absent from the infrastructure cluster network config

Intentional behavioral changes:

  • awsprivatelink_controller.go: Public-endpoint cleanup now scopes the List to hcp.Namespace (previously listed cluster-wide). These services are only ever created in hcp.Namespace, so this prevents accidental cross-namespace deletes.
  • hostedcontrolplane_controller.go: reconcileKubeadminPasswordStatus now explicitly nils Status.KubeadminPassword when the secret is NotFound. The original code left stale references; this clears them.
  • cmd/cluster/core/create.go: parseKeyValuePairs now rejects entries with an empty key (e.g. =value). The original code silently created an entry with an empty string key.

Verification:

  • make lint passes with 0 issues
  • make test passes with 0 failures
  • make verify-codespell passes with 0 issues
  • go build ./... compiles cleanly

Which issue(s) this PR fixes:

CNTRLPLANE-3324

Special notes for your reviewer:

This PR is primarily refactoring (complexity reduction via helper extraction), but it also includes four bug fixes that were discovered during the refactoring. The bug fixes are listed explicitly above and each has dedicated test coverage:

  • TestReconcileExistingEndpoint — covers the VpcEndpoints bounds check
  • TestReconcileDeletion — covers the UnauthorizedOperation switch fix
  • TestConvertBasicAuthIdentityProvider — covers the cert/key mount collision
  • TestFetchInfraClusterNetwork — covers the namespace fix and error propagation in network policies

The 4 nolint exceptions are for functions whose complexity is inherent to their role as top-level orchestrators. Breaking them down further would obscure the reconciliation flow rather than improve it.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

AI Assistance

Tool: Claude Code 2.1.126
Model: claude-opus-4-6

Skills used:
- personal-claude-skills:behavior-driven-testing@0.1.0 (bryan-cox/personal-claude-skills@13dbad17) — Gherkin-style behavior-driven tests
- ai-helpers:utils:address-reviews@0.0.1 (openshift-eng/ai-helpers@c8e587eb) — Systematic PR review comment resolution

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors many large functions across the repository into smaller unexported helper methods, reorganizing control flow while preserving external behavior and public signatures. Areas changed include cluster create flows, AWS/Azure infra orchestration, install validation, multiple controllers (control-plane, hypershift, ignition), metrics/scheduler collectors, and extensive unit tests. .golangci.yml enables the gocyclo linter and several high-complexity functions received //nolint:gocyclo annotations. No public API signatures were intentionally changed except validated internal method decompositions.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI
  participant CreateInfra as CreateInfra
  participant SessionMgr as AWS Session Manager
  participant EC2 as AWS EC2
  participant R53 as AWS Route53
  participant VPC as VPC Resources
  participant Proxy as Proxy Helper

  CLI->>CreateInfra: start CreateInfra
  CreateInfra->>SessionMgr: init AWSSessions
  SessionMgr->>EC2: init per-account clients
  CreateInfra->>VPC: createVPCResources
  VPC-->>CreateInfra: VPC info
  CreateInfra->>EC2: createPerZoneResources (subnets, routes, endpoints)
  CreateInfra->>R53: LookupPublicZone
  R53-->>CreateInfra: hosted zone info
  CreateInfra->>CreateInfra: shareSubnets if vpc-owner session exists
  CreateInfra->>Proxy: createProxyResources (or return if disabled)
  Proxy-->>EC2: create SG / endpoints (wrap SG errors)
  CreateInfra-->>CLI: return output / errors
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 1.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests violate single responsibility principle with multiple unrelated assertions per test case, lack cleanup mechanisms entirely, have inconsistent assertion messages, and missing timeout protection. Refactor tests for single responsibility, add t.Cleanup() blocks, include meaningful assertion messages on all Expect() calls, and add explicit timeouts with context.WithTimeout() or WithTimeout().
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically describes the main change: enabling the gocyclo linter and refactoring code to reduce cyclomatic complexity.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing framework, not Ginkgo. All test names are static and descriptive, containing no dynamic information.
Microshift Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests; only utility helper function refactorings in test/e2e/util/, with substantial test additions being unit tests in non-e2e directories.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any new Ginkgo e2e tests. The test additions are exclusively unit tests using the standard Go testing package, along with helper utility functions in test/e2e/util packages that contain reusable helper functions, not Ginkgo test definitions.
Topology-Aware Scheduling Compatibility ✅ Passed PR is a pure refactoring that extracts functions into helpers. No new scheduling constraints (affinity, tolerations, topology spread, pod disruption budgets, or replica modifications) are introduced that would affect OpenShift topology compatibility.
Ote Binary Stdout Contract ✅ Passed The PR does not introduce stdout writes in process-level code that would violate the OTE binary stdout contract. All stdout writes found are within regular helper functions, not in process-level contexts.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not introduce any new Ginkgo e2e tests, only standard Go unit tests and refactored e2e utility helpers.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 71.05344% with 937 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.72%. Comparing base (37f46b9) to head (f2423d3).
⚠️ Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
...ostedcontrolplane/hostedcontrolplane_controller.go 62.91% 159 Missing and 42 partials ⚠️
...ollers/awsprivatelink/awsprivatelink_controller.go 33.72% 108 Missing and 4 partials ⚠️
cmd/infra/aws/create.go 0.00% 96 Missing ⚠️
cmd/cluster/core/create.go 60.36% 72 Missing and 16 partials ⚠️
...ator/controllers/hostedcluster/network_policies.go 60.73% 53 Missing and 22 partials ⚠️
cmd/fix/dr_oidc_iam.go 0.00% 70 Missing ⚠️
cmd/infra/azure/create.go 0.00% 53 Missing ⚠️
...perator/controllers/hostedcluster/etcd_recovery.go 39.70% 37 Missing and 4 partials ⚠️
...rconfigoperator/controllers/resources/resources.go 77.01% 31 Missing and 9 partials ⚠️
...hift-operator/controllers/etcdbackup/reconciler.go 56.57% 24 Missing and 9 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8309      +/-   ##
==========================================
+ Coverage   37.53%   39.72%   +2.18%     
==========================================
  Files         751      751              
  Lines       92025    92461     +436     
==========================================
+ Hits        34543    36729    +2186     
+ Misses      54841    53076    -1765     
- Partials     2641     2656      +15     
Files with missing lines Coverage Δ
...trollers/hostedcluster/hostedcluster_controller.go 43.23% <ø> (ø)
...ollers/hostedcluster/internal/platform/platform.go 0.00% <ø> (ø)
hypershift-operator/controllers/nodepool/aws.go 69.41% <ø> (+14.36%) ⬆️
hypershift-operator/controllers/nodepool/capi.go 71.74% <ø> (+2.87%) ⬆️
...t-operator/controllers/nodepool/metrics/metrics.go 72.72% <ø> (+1.38%) ⬆️
...erator/controllers/nodepool/nodepool_controller.go 41.30% <ø> (ø)
...ft-operator/controllers/platform/aws/controller.go 41.06% <ø> (+2.55%) ⬆️
...t-operator/controllers/scheduler/aws/autoscaler.go 60.42% <ø> (+3.40%) ⬆️
...s/scheduler/aws/dedicated_request_serving_nodes.go 61.08% <ø> (+7.53%) ⬆️
hypershift-operator/main.go 0.00% <ø> (ø)
... and 20 more

... and 1 file with indirect coverage changes

Flag Coverage Δ
cmd-support 33.99% <59.52%> (+1.23%) ⬆️
cpo-hostedcontrolplane 40.52% <76.60%> (+3.74%) ⬆️
cpo-other 40.02% <66.37%> (+2.26%) ⬆️
hypershift-operator 50.29% <76.99%> (+2.35%) ⬆️
other 29.70% <ø> (+1.92%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)

1256-1265: ⚠️ Potential issue | 🟠 Major

Populate TypeMeta in manifest constructors to fix capability-gating lookups.

The manifest constructors in rbac.go initialize only ObjectMeta, not TypeMeta. When getKey() calls obj.GetObjectKind().GroupVersionKind() on these freshly constructed objects, the Kind field is empty because TypeMeta was never populated. This produces malformed keys (e.g., "name/" instead of "name/ClusterRole"), causing silent mismatches in the RbacCapabilityMap lookup at line 1331. RBAC capability gating is then bypassed entirely.

Each manifest constructor must explicitly set TypeMeta with the appropriate Kind and APIVersion. For example:

func IngressToRouteControllerClusterRole() *rbacv1.ClusterRole {
	return &rbacv1.ClusterRole{
		TypeMeta: metav1.TypeMeta{
			Kind:       "ClusterRole",
			APIVersion: rbacv1.SchemeGroupVersion.String(),
		},
		ObjectMeta: metav1.ObjectMeta{
			Name: "system:openshift:openshift-controller-manager:ingress-to-route-controller",
		},
	}
}

Apply this fix to all manifest constructors in control-plane-operator/hostedclusterconfigoperator/controllers/resources/manifests/rbac.go and any other manifest files that feed into getKey().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 1256 - 1265, getKey() relies on objects' TypeMeta but many manifest
constructors (e.g., IngressToRouteControllerClusterRole and other constructors
in manifests/rbac.go) only set ObjectMeta, leaving Kind/APIVersion empty and
producing malformed keys that break the RbacCapabilityMap lookup; fix each
manifest constructor that returns Kubernetes types (ClusterRole, Role,
ClusterRoleBinding, RoleBinding, etc.) to populate TypeMeta with the correct
Kind and APIVersion (use the corresponding SchemeGroupVersion.String() for
APIVersion and the literal Kind name) so obj.GetObjectKind().GroupVersionKind()
returns valid values and getKey()/RbacCapabilityMap lookups work correctly.
hypershift-operator/controllers/scheduler/aws/autoscaler.go (1)

420-432: ⚠️ Potential issue | 🟠 Major

Don't mark the empty pair label as taken.

This now inserts "" for any hosted-cluster node that lacks OSDFleetManagerPairedNodesLabel. Downstream, takenPairLabels.Has(machineSetPairLabel(ms)) will then reject every unpaired node/machineset, which can block future scale-ups for the unpaired case entirely.

Suggested fix
 	for _, n := range nodes {
 		if n.Labels[hyperv1.HostedClusterLabel] != "" {
-			takenPairLabels.Insert(n.Labels[OSDFleetManagerPairedNodesLabel])
+			if pairLabel := n.Labels[OSDFleetManagerPairedNodesLabel]; pairLabel != "" {
+				takenPairLabels.Insert(pairLabel)
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/scheduler/aws/autoscaler.go` around lines 420
- 432, collectTakenPairLabels currently inserts an empty string when a
hosted-cluster node lacks OSDFleetManagerPairedNodesLabel, causing downstream
has checks (e.g., machineSetPairLabel comparisons) to incorrectly reject
unpaired resources; update collectTakenPairLabels so it only inserts
n.Labels[OSDFleetManagerPairedNodesLabel] when that label exists and is
non-empty (i.e., check the paired-label string != "" before calling
takenPairLabels.Insert), keeping the existing pod check via podPairLabel as-is.
hypershift-operator/controllers/nodepool/capi.go (1)

520-543: ⚠️ Potential issue | 🟠 Major

Removed NodePool labels will stick on existing Machines.

This helper only writes the current Spec.NodeLabels, so previously propagated labelManagedPrefix.* keys are never pruned from existing Machines. Label removals on the NodePool therefore never fully take effect, which can leave stale scheduling selectors on Machines/Nodes indefinitely.

Suggested fix
 		if result, err := controllerutil.CreateOrPatch(ctx, c.Client, &machine, func() error {
 			if machine.Labels == nil {
 				machine.Labels = make(map[string]string)
 			}
 			if machine.Annotations == nil {
 				machine.Annotations = make(map[string]string)
 			}
+
+			for k := range machine.Labels {
+				if strings.HasPrefix(k, labelManagedPrefix+".") {
+					delete(machine.Labels, k)
+				}
+			}
 
 			for k, v := range nodePool.Spec.NodeLabels {
 				labelKey := fmt.Sprintf("%s.%s", labelManagedPrefix, k)
 				machine.Labels[labelKey] = v
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/capi.go` around lines 520 - 543, The
current CreateOrPatch patcher only adds/updates labelManagedPrefix.* keys from
nodePool.Spec.NodeLabels but never removes managed keys that were deleted from
the NodePool; update the mutation function passed to
controllerutil.CreateOrPatch to prune stale managed labels: scan machine.Labels
for keys with the prefix labelManagedPrefix and delete any whose suffix key is
not present in nodePool.Spec.NodeLabels, then add/update all keys from
nodePool.Spec.NodeLabels as before; also ensure the globalPSNodeLabel key
(constructed via fmt.Sprintf("%s.%s", labelManagedPrefix, globalPSNodeLabel)) is
set when Platform.Type is AWS/Azure and removed when it is not; leave taints
handling unchanged (taintsToJSON and nodePoolAnnotationTaints).
🟠 Major comments (18)
test/e2e/util/dump/journals.go-37-45 (1)

37-45: ⚠️ Potential issue | 🟠 Major

Temporary SSH key/script artifacts are not cleaned up

Line [37] and Line [42] create temp artifacts, but DumpJournals never removes them. This leaves private key material and temp scripts on disk after use.

💡 Proposed fix
 	privateKeyFile, err := setupSSHKey(ctx, hc)
 	if err != nil {
 		return err
 	}
+	defer os.RemoveAll(filepath.Dir(privateKeyFile))

 	copyJournalFile, err := writeCopyJournalScript()
 	if err != nil {
 		return err
 	}
+	defer os.Remove(copyJournalFile)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/dump/journals.go` around lines 37 - 45, DumpJournals creates
temporary artifacts (privateKeyFile from setupSSHKey and copyJournalFile from
writeCopyJournalScript) but never removes them; update DumpJournals to remove
these temp files when done and on error: after each successful creation defer
removal (or add a cleanup block at function exit) that calls os.Remove for
privateKeyFile and copyJournalFile, and ensure deferred removals are set up
immediately after each successful call so error paths also clean up; reference
the variables privateKeyFile, copyJournalFile and the function DumpJournals when
applying the change.
test/e2e/util/dump/journals.go-243-246 (1)

243-246: ⚠️ Potential issue | 🟠 Major

Avoid defaulting SSH ingress to world-open CIDR

Line [245] falls back to 0.0.0.0/0, which opens SSH access globally if SSH_SOURCE_CIDR is unset.

💡 Proposed fix
 	sourceCIDR := os.Getenv("SSH_SOURCE_CIDR")
 	if sourceCIDR == "" {
-		sourceCIDR = "0.0.0.0/0"
+		return fmt.Errorf("SSH_SOURCE_CIDR must be set; refusing to authorize SSH from 0.0.0.0/0")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/dump/journals.go` around lines 243 - 246, Do not default
SSH_SOURCE_CIDR to the world-open "0.0.0.0/0"; instead make sourceCIDR (the
SSH_SOURCE_CIDR env var used in test/e2e/util/dump/journals.go) required or
default to a safe narrow CIDR (e.g. "127.0.0.1/32" or a private CIDR) and fail
fast if unset. Update the code where sourceCIDR is read so it validates the env
var, returns an error (or calls t.Fatal / os.Exit in the test helper) when the
value is missing/invalid, or uses the safe narrow default, and ensure any place
that consumes sourceCIDR handles the error/validated value.
test/e2e/util/dump/journals.go-199-217 (1)

199-217: ⚠️ Potential issue | 🟠 Major

sgID is never captured for private-IP path

Line [210] only initializes sgID when usePublicIPs is true. For private-IP flow, authorizeSSHAccess receives empty sgID and becomes a no-op.

💡 Proposed fix
 	for _, reservation := range result.Reservations {
 		for _, instance := range reservation.Instances {
 			if isBastionInstance(instance, hc.Spec.InfraID) {
 				continue
 			}
 			if string(instance.State.Name) != "running" {
 				continue
 			}
+			if sgID == "" && len(instance.SecurityGroups) > 0 {
+				sgID = aws.ToString(instance.SecurityGroups[0].GroupId)
+			}
 			if usePublicIPs {
 				if aws.ToString(instance.PublicIpAddress) != "" {
-					if sgID == "" && len(instance.SecurityGroups) > 0 {
-						sgID = aws.ToString(instance.SecurityGroups[0].GroupId)
-					}
 					machineIPs = append(machineIPs, aws.ToString(instance.PublicIpAddress))
 				}
 			} else {
 				machineIPs = append(machineIPs, aws.ToString(instance.PrivateIpAddress))
 			}
 			machineInstances = append(machineInstances, instance)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/dump/journals.go` around lines 199 - 217, The loop over EC2
instances only sets sgID inside the usePublicIPs branch so authorizeSSHAccess
gets an empty sgID for private-IP flows; move or duplicate the sgID assignment
so that whenever sgID == "" and instance.SecurityGroups has entries you set sgID
= aws.ToString(instance.SecurityGroups[0].GroupId) regardless of usePublicIPs
(inside the same for _, instance range before appending IPs), ensuring
authorizeSSHAccess receives the captured security group ID; reference
variables/functions: sgID, machineIPs, authorizeSSHAccess, isBastionInstance.
test/e2e/util/dump/journals.go-164-173 (1)

164-173: ⚠️ Potential issue | 🟠 Major

Bastion teardown timing moved too late in lifecycle

Line [164] registers bastion destruction with t.Cleanup, but DumpJournals is already called from cleanup (test/e2e/util/fixture.go, Line [410]). This can delay teardown ordering and run with a stale/canceled ctx, causing bastion leaks.

💡 Proposed fix
-func setupBastion(t *testing.T, ctx context.Context, hc *hyperv1.HostedCluster, awsCreds string, createLogger, destroyLogger *zap.Logger) (string, error) {
+func setupBastion(t *testing.T, ctx context.Context, hc *hyperv1.HostedCluster, awsCreds string, createLogger, destroyLogger *zap.Logger) (string, func(), error) {
 	if hc.Annotations[hyperv1.AWSMachinePublicIPs] == "true" {
-		return "", nil
+		return "", nil, nil
 	}
 	createBastion := bastionaws.CreateBastionOpts{
 		Namespace:          hc.Namespace,
 		Name:               hc.Name,
 		AWSCredentialsFile: awsCreds,
 		Wait:               true,
 	}
 	_, bastionIP, err := createBastion.Run(ctx, zapr.NewLoggerWithOptions(createLogger))
 	if err != nil {
-		return "", err
+		return "", nil, err
 	}
-	t.Cleanup(func() {
+	cleanup := func() {
 		destroyBastion := bastionaws.DestroyBastionOpts{
 			Namespace:          hc.Namespace,
 			Name:               hc.Name,
 			AWSCredentialsFile: awsCreds,
 		}
-		if err := destroyBastion.Run(ctx, zapr.NewLoggerWithOptions(destroyLogger)); err != nil {
+		if err := destroyBastion.Run(context.Background(), zapr.NewLoggerWithOptions(destroyLogger)); err != nil {
 			t.Logf("error destroying bastion: %v", err)
 		}
-	})
-	return bastionIP, nil
+	}
+	return bastionIP, cleanup, nil
 }
-	bastionIP, err := setupBastion(t, ctx, hc, awsCreds, createLogger, destroyLogger)
+	bastionIP, cleanupBastion, err := setupBastion(t, ctx, hc, awsCreds, createLogger, destroyLogger)
 	if err != nil {
 		return err
 	}
+	if cleanupBastion != nil {
+		defer cleanupBastion()
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/dump/journals.go` around lines 164 - 173, The bastion
destruction is being deferred via t.Cleanup which runs after DumpJournals
(registered elsewhere) and may use a canceled ctx; instead, execute the teardown
synchronously or register it earlier so it runs before DumpJournals. Locate the
bastionaws.DestroyBastionOpts creation (destroyBastion) in
DumpJournals/journals.go and replace the t.Cleanup wrapper with an immediate
call to destroyBastion.Run(ctx, zapr.NewLoggerWithOptions(destroyLogger))
(handling/logging the error), or move the t.Cleanup registration to execute
before DumpJournals is registered (ensure the destroy registration occurs
earlier in the test lifecycle) so the bastion is torn down while ctx is still
valid.
control-plane-operator/controllers/azureprivatelinkservice/controller.go-960-974 (1)

960-974: ⚠️ Potential issue | 🟠 Major

Don't tear down the shared base-domain VNet link before checking for siblings.

reconcileDNSZone already handles the case where only one sibling CR actually owns the base-domain VNet link. Here we always try to delete baseDomainVNetLinkName(azPLS.Name) before hasSiblingCR() runs, so if the deleting CR is the one that created the real link, the surviving sibling loses name resolution even though we keep the zone. Please gate shared-link teardown on the sibling check, or otherwise resolve and delete the actual shared link only when the last sibling is gone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`
around lines 960 - 974, The code deletes the base-domain VNet link
(deleteVNetLink with baseDomainVNetLinkName(azPLS.Name)) before checking for
sibling CRs (hasSiblingCR), which can break name resolution for surviving
siblings; move or gate the shared-link teardown so it only runs after confirming
there are no sibling CRs (call hasSiblingCR first and only invoke
deleteVNetLink/deleteDNSZone when hasSiblings is false), or alternatively
resolve the actual shared link owner and delete it only when the last sibling is
gone (adjust reconcileDNSZone logic to perform the shared-link deletion after
the sibling check).
control-plane-operator/main.go-181-223 (1)

181-223: ⚠️ Potential issue | 🟠 Major

Preserve the caller's context in newOperatorImageLookup.

Line 187 switches these API reads to context.Background(), so the image lookup no longer honors the startup/shutdown context or the 30s PollUntilContextTimeout budget. A stuck apiserver read can now block this path past the caller's timeout. Please thread the caller's ctx into the helper instead of creating a background context here.

Suggested fix
-func newOperatorImageLookup(mgr ctrl.Manager, namespace, deploymentName string) func(string) (string, error) {
+func newOperatorImageLookup(ctx context.Context, mgr ctrl.Manager, namespace, deploymentName string) func(string) (string, error) {
 	return func(userSpecifiedImage string) (string, error) {
 		if len(userSpecifiedImage) > 0 {
 			setupLog.Info("using image from arguments", "image", userSpecifiedImage)
 			return userSpecifiedImage, nil
 		}
-		ctx := context.Background()
 		if podName := os.Getenv("POD_NAME"); podName != "" {
 			me := &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Namespace: namespace, Name: podName}}
 			if err := mgr.GetAPIReader().Get(ctx, crclient.ObjectKeyFromObject(me), me); err != nil {
 				return "", fmt.Errorf("failed to get operator pod %s: %w", crclient.ObjectKeyFromObject(me), err)
 			}
@@
-		lookupOperatorImage := newOperatorImageLookup(mgr, namespace, deploymentName)
+		lookupOperatorImage := newOperatorImageLookup(ctx, mgr, namespace, deploymentName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/main.go` around lines 181 - 223, The returned lookup
currently creates a background context and ignores the caller's timeout; change
newOperatorImageLookup to accept a context from the caller by altering its
returned function signature to func(ctx context.Context, userSpecifiedImage
string) (string, error) and replace ctx := context.Background() with the
provided ctx so all mgr.GetAPIReader().Get calls (and the pod/deployment lookups
in the closure) honor the caller's cancellation/deadline (references:
newOperatorImageLookup, the closure's Get calls, the podName branch, and the
deployment lookup).
ignition-server/controllers/local_ignitionprovider.go-544-550 (1)

544-550: ⚠️ Potential issue | 🟠 Major

Propagate early machine-config-server exits instead of timing out.

The goroutine only logs CombinedOutput(). If machine-config-server dies before serving /config/master, the poll loop keeps retrying until the outer context expires, so callers lose the real stderr and get a misleading timeout. Feed the subprocess result back into the poll loop and fail as soon as the process exits unexpectedly.

Also applies to: 556-590

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 544 -
550, The goroutine that starts machine-config-server currently only logs
cmd.CombinedOutput() and doesn't propagate process exit to the poll loop; change
it to send the subprocess result back so the poll loop can fail fast. Create a
channel (e.g., cmdErrChan) and in the goroutine run out, err :=
cmd.CombinedOutput(); send a wrapped error (including stdout/stderr) on
cmdErrChan (or nil on success), then in the poll loop select on ctx.Done(),
time.After(...), and cmdErrChan so that if machine-config-server exits early the
function returns the subprocess error immediately. Apply the same change for the
duplicate block covering the other machine-config-server start (the section
around lines 556-590).
ignition-server/controllers/local_ignitionprovider.go-352-363 (1)

352-363: ⚠️ Potential issue | 🟠 Major

Close the extracted binary file on every error path.

If Line 356 or Line 359 fails, file.Close() is never reached. On repeated retries that can leak descriptors in the controller process. Wrap this block so the file is closed via defer immediately after os.Create.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 352 -
363, The created file handle from os.Create (variable file) can leak on the
error paths in the block around file.Chmod and
p.ImageFileCache.extractImageFile; immediately after the os.Create call in the
same scope, add a deferred close (defer func() { _ = file.Close() }()) so the
file is always closed on any early return, and then remove or adapt the explicit
file.Close() later (or check and return any Close error) to avoid double-close
handling; keep the reframing local to this block around os.Create, file.Chmod,
p.ImageFileCache.extractImageFile and file.Close.
ignition-server/controllers/local_ignitionprovider.go-266-267 (1)

266-267: ⚠️ Potential issue | 🟠 Major

Restrict permissions on the kubeconfig and pull secret files.

Lines 266-267 and 404-405 write credentials with 0644, which makes them readable to any user that can access the workdir. These files only need to be readable by the current process and its child processes, so they should be 0600.

Suggested fix
-	if err := os.WriteFile(filepath.Join(dirs.mcsDir, "kubeconfig"), bootstrapKubeConfig, 0644); err != nil {
+	if err := os.WriteFile(filepath.Join(dirs.mcsDir, "kubeconfig"), bootstrapKubeConfig, 0600); err != nil {
 		return fmt.Errorf("failed to write bootstrap kubeconfig: %w", err)
 	}
-	if err = os.WriteFile(fmt.Sprintf("%s/pull-secret.yaml", dirs.configDir), []byte(serializedPullSecret), 0644); err != nil {
+	if err = os.WriteFile(fmt.Sprintf("%s/pull-secret.yaml", dirs.configDir), []byte(serializedPullSecret), 0600); err != nil {
 		return fmt.Errorf("failed to write pull secret to config dir: %w", err)
 	}

Also applies to: 404-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 266 -
267, Change the file write permissions for the credential files written with
os.WriteFile to be owner-only readable/writable (0600) instead of 0644: update
the os.WriteFile call that writes filepath.Join(dirs.mcsDir, "kubeconfig") with
bootstrapKubeConfig to use file mode 0600, and likewise update the other
os.WriteFile call (the pull secret write at the second occurrence around lines
404-405) to use 0600 so both kubeconfig and pull secret are only accessible to
the current user/process.
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go-578-586 (1)

578-586: ⚠️ Potential issue | 🟠 Major

Don’t return from the whole helper on a registry Get failure.

A non-NotFound error at Line 585 exits reconcileRegistryAndIngress before the additional trusted CA sync, ingress reconciliation, and reconcileAuthOIDC run. The surrounding refactor moved toward aggregating sub-step failures, so this now blocks unrelated resources behind a registry read error.

Suggested fix
 func (r *reconciler) reconcileRegistryAndIngress(ctx context.Context, hcp *hyperv1.HostedControlPlane, log logr.Logger) []error {
 	var errs []error
+	registryConfigReadable := true
 
 	registryConfig := manifests.Registry()
 	var registryConfigExists bool
 	if err := r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig); err != nil {
 		if !apierrors.IsNotFound(err) {
-			return []error{fmt.Errorf("failed to get registry config: %w", err)}
+			errs = append(errs, fmt.Errorf("failed to get registry config: %w", err))
+			registryConfigReadable = false
 		}
 	} else {
 		registryConfigExists = true
 	}
 
-	if capabilities.IsImageRegistryCapabilityEnabled(hcp.Spec.Capabilities) {
+	if registryConfigReadable && capabilities.IsImageRegistryCapabilityEnabled(hcp.Spec.Capabilities) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 578 - 586, The helper reconcileRegistryAndIngress currently returns
early when r.client.Get for registryConfig errors (non-NotFound), which prevents
later steps (trusted CA sync, ingress reconciliation, reconcileAuthOIDC) from
running; instead, change the error handling for r.client.Get to append the
formatted error to errs (e.g., errs = append(errs, fmt.Errorf("failed to get
registry config: %w", err))) and continue execution so subsequent reconciliation
steps still run, while preserving the registryConfigExists boolean semantics
(set true only when Get succeeds) and leaving NotFound treated as a non-error.
hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go-88-107 (1)

88-107: ⚠️ Potential issue | 🟠 Major

Propagate annotation-repair patch failures instead of continuing on stale state.

Line 101 and Line 122 currently log patch failures and continue. That can preserve corrupted annotations and cause repeated snapshots for the same restart progression.

🔧 Suggested fix
-func (r *SnapshotReconciler) getLastObservedRestartCount(ctx context.Context, pod *corev1.Pod, log logr.Logger) int32 {
+func (r *SnapshotReconciler) getLastObservedRestartCount(ctx context.Context, pod *corev1.Pod, log logr.Logger) (int32, error) {
 	val, ok := pod.Annotations[lastObservedRestartCountAnnotation]
 	if !ok {
-		return 0
+		return 0, nil
 	}
 	count, err := parseInt32(val)
 	if err != nil {
 		log.V(1).Info("Failed to parse last observed restart count annotation, resetting to 0", "annotationValue", val, "error", err)
 		podCopy := pod.DeepCopy()
@@
-		if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil {
-			log.Error(patchErr, "Failed to reset corrupted annotation")
-		}
-		return 0
+		if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil {
+			return 0, fmt.Errorf("failed to reset corrupted %s annotation: %w", lastObservedRestartCountAnnotation, patchErr)
+		}
+		return 0, nil
 	}
-	return count
+	return count, nil
 }
 
-func (r *SnapshotReconciler) checkSnapshotInterval(ctx context.Context, pod *corev1.Pod, spec *auditlogpersistencev1alpha1.AuditLogPersistenceConfigSpec, log logr.Logger) (shouldSnapshot bool, skipReconcile bool) {
+func (r *SnapshotReconciler) checkSnapshotInterval(ctx context.Context, pod *corev1.Pod, spec *auditlogpersistencev1alpha1.AuditLogPersistenceConfigSpec, log logr.Logger) (shouldSnapshot bool, skipReconcile bool, err error) {
@@
-		if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil {
-			log.Error(patchErr, "Failed to remove corrupted last snapshot time annotation")
-		}
-		return true, false
+		if patchErr := r.client.Patch(ctx, podCopy, client.MergeFrom(pod)); patchErr != nil {
+			return false, false, fmt.Errorf("failed to remove corrupted %s annotation: %w", lastSnapshotTimeAnnotation, patchErr)
+		}
+		return true, false, nil
 	}
@@
-		return true, false
+		return true, false, nil
 	}
 	if time.Since(lastSnapshotTime) >= minInterval {
-		return true, false
+		return true, false, nil
 	}
-	return false, true
+	return false, true, nil
 }
-	lastObservedRestartCount := r.getLastObservedRestartCount(ctx, pod, log)
+	lastObservedRestartCount, err := r.getLastObservedRestartCount(ctx, pod, log)
+	if err != nil {
+		return ctrl.Result{}, err
+	}
@@
-	shouldSnapshot, skipReconcile := r.checkSnapshotInterval(ctx, pod, spec, log)
+	shouldSnapshot, skipReconcile, err := r.checkSnapshotInterval(ctx, pod, spec, log)
+	if err != nil {
+		return ctrl.Result{}, err
+	}

Also applies to: 109-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go`
around lines 88 - 107, The getLastObservedRestartCount function
(SnapshotReconciler.getLastObservedRestartCount) currently swallows
r.client.Patch errors and continues, leaving corrupted annotations; change its
signature to return (int32, error), attempt the annotation reset as now, and if
r.client.Patch(ctx, podCopy, client.MergeFrom(pod)) fails return 0 and the patch
error (wrap with context) instead of only logging; update all callers to
handle/propagate the error (or requeue) so the reconcile loop can retry; apply
the same change/behavior to the other similar block handling the
lastObservedRestartCountAnnotation in the file (lines ~109-136), ensuring
parseInt32 parsing errors lead to a patch attempt and any patch failure is
returned upward.
test/e2e/util/util.go-2493-2510 (1)

2493-2510: ⚠️ Potential issue | 🟠 Major

Move the fallback out of the service loop.

The default branch runs for every non-APIServer service, so it can overwrite a valid destination found earlier. It also means an APIServer entry with an empty hostname never falls back unless some other service happens to be present later in the slice.

[suggested fix]

Proposed fix
 func findExternalNameDestination(t *testing.T, ctx context.Context, g Gomega, mgmtClient crclient.Client, hc *hyperv1.HostedCluster, hcpNamespace string) string {
-	var dest string
 	for _, service := range hc.Spec.Services {
-		fmt.Printf("service: %+v\n", service)
-		switch service.Service {
-		case hyperv1.APIServer:
-			dest = apiServerExternalName(service)
-		default:
-			t.Log("service custom DNS name not found, using the control plane endpoint")
-			hcp := &hyperv1.HostedControlPlane{}
-			err := mgmtClient.Get(ctx, types.NamespacedName{Namespace: hcpNamespace, Name: hc.Name}, hcp)
-			g.Expect(err).NotTo(HaveOccurred(), "failed to get updated HostedControlPlane")
-			g.Expect(hcp.Status.ControlPlaneEndpoint.Host).NotTo(BeEmpty(), "failed to get the control plane endpoint")
-			dest = hcp.Status.ControlPlaneEndpoint.Host
-		}
+		if service.Service != hyperv1.APIServer {
+			continue
+		}
+		if dest := apiServerExternalName(service); dest != "" {
+			return dest
+		}
+		break
 	}
-	g.Expect(dest).NotTo(BeEmpty(), "failed to get the external name destination")
-	return dest
+
+	t.Log("service custom DNS name not found, using the control plane endpoint")
+	hcp := &hyperv1.HostedControlPlane{}
+	err := mgmtClient.Get(ctx, types.NamespacedName{Namespace: hcpNamespace, Name: hc.Name}, hcp)
+	g.Expect(err).NotTo(HaveOccurred(), "failed to get updated HostedControlPlane")
+	g.Expect(hcp.Status.ControlPlaneEndpoint.Host).NotTo(BeEmpty(), "failed to get the control plane endpoint")
+	return hcp.Status.ControlPlaneEndpoint.Host
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/util.go` around lines 2493 - 2510, The fallback that reads the
HostedControlPlane and sets dest should be moved out of the loop in
findExternalNameDestination so it does not run for every non-APIServer service
and overwrite a previously found value; iterate hc.Spec.Services, set dest only
when service.Service == hyperv1.APIServer (using apiServerExternalName(service))
and break when a non-empty dest is found, then after the loop, if dest is empty
perform the mgmtClient.Get(...) on HostedControlPlane and set dest =
hcp.Status.ControlPlaneEndpoint.Host with the same g.Expect checks.
hypershift-operator/controllers/hostedcluster/etcd_recovery.go-48-55 (1)

48-55: ⚠️ Potential issue | 🟠 Major

Don’t fall through while a recovery Job is still running.

Line 49 only logs “waiting” for an unfinished Job, but Line 54 still re-enters detection/trigger logic. The outer reconcile loop only keeps polling when this method returns a requeueAfter, so an in-flight recovery can stop being watched as soon as ETCD looks healthy, leaving cleanup and EtcdRecoveryActive=False to chance.

Also applies to: 57-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines
48 - 55, When a recovery Job is still running you must not proceed into
detection/trigger logic; after calling handleExistingEtcdRecoveryJob (and when
jobStatus.exists is true and indicates in-flight) return immediately with a
requeue result so the outer reconcile continues watching instead of calling
detectAndTriggerEtcdRecovery; apply the same early-return/requeue behavior to
the other analogous block (lines 57-60) that currently falls through. Ensure you
detect the "running"/in-flight state from jobStatus and return a requeue rather
than calling detectAndTriggerEtcdRecovery or further logic.
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go-169-170 (1)

169-170: ⚠️ Potential issue | 🟠 Major

Keep the requeue in the “no-op” transition paths.

Line 425 and Line 465 return &noOpAction precisely when the delay/window has not elapsed but the conditions already match. That sentinel carries no requeueAfter, and Line 342/Line 349 then suppress it entirely, so the controller can stop reconciling before the pending resize is actually allowed to proceed.

Suggested direction
- // Conditions already match status; no update needed but delay has not elapsed, so stop processing.
- return &noOpAction
+ // Conditions already match status; no update needed, but we still must wake up
+ // when the delay expires so the transition can continue.
+ return &action{requeueAfter: delayStart.Add(delay).Sub(r.now())}
- if result == &noOpAction {
- 	return nil, nil
- }
  return result, nil

Do the same for the concurrency-limit path by returning a plain action{requeueAfter: durationUntilChanges} even when no status patch is needed.

Also applies to: 341-352, 381-425, 428-465

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go`
around lines 169 - 170, The controller returns the sentinel noOpAction (var
noOpAction = action{}) on transition paths where a delay/window hasn't elapsed,
which carries no requeueAfter and prevents the reconcile from being requeued;
instead, in those paths (including the concurrency-limit branch) return an
action with requeueAfter set to the computed duration (e.g. return
action{requeueAfter: durationUntilChanges}) rather than &noOpAction so the
controller will requeue until the delay/window has passed; update all places
that currently return &noOpAction when the only reason to stop is timing or
concurrency (keep using the existing action struct and durationUntilChanges
variable name) so downstream logic still patches nothing but schedules a
requeue.
cmd/cluster/core/create.go-467-476 (1)

467-476: ⚠️ Potential issue | 🟠 Major

--pausedUntil=true is ignored by the new helper.

Line 468 treats "true" the same as “unset” and returns before assigning cluster.Spec.PausedUntil. That drops the documented “pause until the field is removed” behavior for the boolean form.

Suggested fix
 func applyPausedUntil(cluster *hyperv1.HostedCluster, opts *CreateOptions) error {
-	if len(opts.PausedUntil) == 0 || opts.PausedUntil == "true" {
+	if len(opts.PausedUntil) == 0 {
 		return nil
 	}
+	if opts.PausedUntil == "true" {
+		cluster.Spec.PausedUntil = &opts.PausedUntil
+		return nil
+	}
 	_, err := time.Parse(time.RFC3339, opts.PausedUntil)
 	if err != nil {
 		return fmt.Errorf("invalid pausedUntil value, should be \"true\" or a valid RFC3339 date format: %w", err)
 	}
 	cluster.Spec.PausedUntil = &opts.PausedUntil
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create.go` around lines 467 - 476, The helper currently
treats opts.PausedUntil == "true" as unset and returns early, dropping the
documented boolean behavior; change applyPausedUntil so that when
opts.PausedUntil == "true" you explicitly set cluster.Spec.PausedUntil to a
pointer to an empty string (e.g. empty := ""; cluster.Spec.PausedUntil = &empty)
and return nil, keep the existing len(opts.PausedUntil) == 0 early return for
unset, and only run time.Parse when opts.PausedUntil is neither empty nor
"true".
hypershift-operator/controllers/nodepool/aws.go-154-175 (1)

154-175: ⚠️ Potential issue | 🟠 Major

Keep the default EBS type in the nil RootVolume path.

buildAWSRootVolume only applies EC2VolumeDefaultType when nodePool.Spec.Platform.AWS.RootVolume is present. If the field is omitted entirely, this helper now returns a volume with Type == "", so NodePools relying on the controller default no longer get the same machine template.

Suggested fix
 func buildAWSRootVolume(nodePool *hyperv1.NodePool) *capiaws.Volume {
 	rootVolume := &capiaws.Volume{
-		Size: EC2VolumeDefaultSize,
+		Size: EC2VolumeDefaultSize,
+		Type: capiaws.VolumeType(EC2VolumeDefaultType),
 	}
 	if nodePool.Spec.Platform.AWS.RootVolume == nil {
 		return rootVolume
 	}
 	if nodePool.Spec.Platform.AWS.RootVolume.Type != "" {
 		rootVolume.Type = capiaws.VolumeType(nodePool.Spec.Platform.AWS.RootVolume.Type)
-	} else {
-		rootVolume.Type = capiaws.VolumeType(EC2VolumeDefaultType)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/aws.go` around lines 154 - 175,
buildAWSRootVolume currently returns a rootVolume with Size set to
EC2VolumeDefaultSize but leaves Type empty when
nodePool.Spec.Platform.AWS.RootVolume is nil; update the function
(buildAWSRootVolume) to set rootVolume.Type to
capiaws.VolumeType(EC2VolumeDefaultType) in the nil RootVolume path (or
initialize Type when creating rootVolume) so the default EBS type is applied
consistently even when RootVolume is omitted.
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go-799-816 (1)

799-816: ⚠️ Potential issue | 🟠 Major

Availability stops checking component health after the first success.

Once HostedControlPlaneAvailable becomes true, controlPlaneComponentsAvailable is never called again. A later component regression will therefore keep Status.Ready/HostedControlPlaneAvailable green even though the control plane is no longer fully available.

Suggested fix
-	availableCondition := meta.FindStatusCondition(hostedControlPlane.Status.Conditions, string(hyperv1.HostedControlPlaneAvailable))
-	alreadyAvailable := availableCondition != nil && availableCondition.Status == metav1.ConditionTrue
-	if !alreadyAvailable {
-		componentsNotAvailableMsg, componentsErr = r.controlPlaneComponentsAvailable(ctx, hostedControlPlane)
-	}
+	componentsNotAvailableMsg, componentsErr = r.controlPlaneComponentsAvailable(ctx, hostedControlPlane)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 799 - 816, The code skips calling controlPlaneComponentsAvailable
once HostedControlPlaneAvailable is true, so regressions aren't detected; always
invoke controlPlaneComponentsAvailable (instead of conditioning it on
alreadyAvailable) and feed its returned componentsNotAvailableMsg and
componentsErr into reconcileAvailabilityStatus (the functions and symbols to
update are controlPlaneComponentsAvailable, reconcileAvailabilityStatus, and the
use of hostedControlPlane.Status.Conditions/HostedControlPlaneAvailable) so
component health is rechecked every reconciliation and the
HostedControlPlaneAvailable/Status.Ready can transition back to false when
needed.
test/e2e/util/reqserving/verifycp.go-238-267 (1)

238-267: ⚠️ Potential issue | 🟠 Major

The timeout fallback can silently pass or panic on missing requests.

If the expected container is absent, this loop appends nothing and verifyResourceRequests can return nil after a timeout. Also, unlike containerResourcesMatch, the fallback path calls Memory().Cmp(...) / Cpu().Cmp(...) without nil checks, so a missing request can panic instead of producing a useful assertion error.

Suggested fix
 	for _, effect := range effects.ResourceRequests {
 		workloadKind := "deployment"
 		if effect.DeploymentName == "etcd" {
 			workloadKind = "statefulset"
 		}
 		containers, getErr := getContainersForEffect(ctx, client, cpNamespace, effect)
 		if getErr != nil {
 			errs = append(errs, fmt.Errorf("failed to get %s %s: %w", workloadKind, effect.DeploymentName, getErr))
 			continue
 		}
+		foundContainer := false
 		for _, container := range containers {
 			if container.Name != effect.ContainerName {
 				continue
 			}
+			foundContainer = true
 			if effect.Memory != nil {
-				if container.Resources.Requests.Memory().Cmp(*effect.Memory) != 0 {
+				if mr := container.Resources.Requests.Memory(); mr == nil || mr.Cmp(*effect.Memory) != 0 {
 					errs = append(errs, fmt.Errorf("%s %s has memory request set to %v, expected %v", workloadKind, effect.DeploymentName, container.Resources.Requests.Memory(), *effect.Memory))
 				}
 			}
 			if effect.CPU != nil {
-				if container.Resources.Requests.Cpu().Cmp(*effect.CPU) != 0 {
+				if cr := container.Resources.Requests.Cpu(); cr == nil || cr.Cmp(*effect.CPU) != 0 {
 					errs = append(errs, fmt.Errorf("%s %s has cpu request set to %v, expected %v", workloadKind, effect.DeploymentName, container.Resources.Requests.Cpu(), *effect.CPU))
 				}
 			}
 			break
 		}
+		if !foundContainer {
+			errs = append(errs, fmt.Errorf("%s %s does not contain container %s", workloadKind, effect.DeploymentName, effect.ContainerName))
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/reqserving/verifycp.go` around lines 238 - 267,
reportResourceMismatches currently can silently succeed or panic when the
expected container or resource requests are missing; update
reportResourceMismatches to (1) detect when no matching container was found for
an effect and append a descriptive error (so verifyResourceRequests won’t return
nil on timeout), and (2) guard access to container.Resources.Requests memory/cpu
before calling Cmp — if a request is absent, append a clear error instead of
calling Memory().Cmp or Cpu().Cmp. Reference the reportResourceMismatches
function and the effect loop (effects.ResourceRequests / effect.ContainerName)
and make the error messages mirror the style used by containerResourcesMatch so
assertions remain informative.
🟡 Minor comments (10)
cmd/infra/aws/create.go-259-259 (1)

259-259: ⚠️ Potential issue | 🟡 Minor

Remove endpointRouteTableIds from the return signature of createPerZoneResources or use it in the caller.

The endpointRouteTableIds return value is computed and returned by createPerZoneResources (line 382) but immediately discarded by the caller at line 259. While the value is used within the helper function itself (passed to CreateVPCS3Endpoint), there's no reason to return it if the caller doesn't need it. Either remove it from the return signature to clean up the API, or if there's a specific reason for returning it, use it in the CreateInfra method.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/create.go` at line 259, The helper createPerZoneResources
currently builds and returns endpointRouteTableIds which the caller in
CreateInfra ignores; remove endpointRouteTableIds from createPerZoneResources'
return signature and from its internal returned tuple, and update the
CreateInfra call site to stop discarding that value (remove the placeholder
assignment). Ensure CreateVPCS3Endpoint still receives the routeTableIds inside
createPerZoneResources where it’s needed and update any related variable
names/returns accordingly so no unused value is returned.
hypershift-operator/controllers/nodepool/metrics/metrics.go-430-433 (1)

430-433: ⚠️ Potential issue | 🟡 Minor

Same nil pointer risk as collectMachineSetReplicas.

MachineDeployment.Spec.Replicas is also a pointer; apply the same guard.

🛡️ Suggested fix
 	for k := range machineDeployments.Items {
 		md := &machineDeployments.Items[k]
-		result[md.Namespace+"/"+md.Name] = *md.Spec.Replicas
+		replicas := int32(0)
+		if md.Spec.Replicas != nil {
+			replicas = *md.Spec.Replicas
+		}
+		result[md.Namespace+"/"+md.Name] = replicas
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/metrics/metrics.go` around lines 430
- 433, The loop that sets result[md.Namespace+"/"+md.Name] = *md.Spec.Replicas
dereferences md.Spec.Replicas without checking for nil (same risk as in
collectMachineSetReplicas); update the loop in metrics.go (the
machineDeployments iteration where md := &machineDeployments.Items[k]) to guard
against nil by checking md.Spec != nil && md.Spec.Replicas != nil before
dereferencing, and only assign to the result map when the pointer is non-nil
(otherwise skip or set a safe default).
hypershift-operator/controllers/nodepool/metrics/metrics.go-417-420 (1)

417-420: ⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereference on Spec.Replicas.

MachineSet.Spec.Replicas is a pointer that could be nil (CAPI defaults it but doesn't guarantee it). Dereferencing without a nil check may cause a panic.

🛡️ Suggested fix
 	for k := range machineSets.Items {
 		machineSet := &machineSets.Items[k]
-		result[machineSet.Namespace+"/"+machineSet.Name] = *machineSet.Spec.Replicas
+		replicas := int32(0)
+		if machineSet.Spec.Replicas != nil {
+			replicas = *machineSet.Spec.Replicas
+		}
+		result[machineSet.Namespace+"/"+machineSet.Name] = replicas
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/nodepool/metrics/metrics.go` around lines 417
- 420, The loop dereferences MachineSet.Spec.Replicas without a nil check which
can panic; update the loop that iterates machineSets.Items (the block
referencing machineSet := &machineSets.Items[k] and
result[...]=*machineSet.Spec.Replicas) to safely handle nil by first checking
machineSet.Spec and machineSet.Spec.Replicas, defaulting to a safe value (e.g.,
int32(0) or another appropriate default) when nil, and then assigning that value
into the result map.
test/e2e/util/dump/journals.go-274-277 (1)

274-277: ⚠️ Potential issue | 🟡 Minor

Close dumpJournalsLog to avoid descriptor leaks

Line [274] opens a file and never closes it. Add a deferred close after successful creation.

💡 Proposed fix
 	dumpJournalsLog, err := os.Create(dumpJournalsLogFile)
 	if err != nil {
 		return fmt.Errorf("failed to create dumpJournals log: %w", err)
 	}
+	defer dumpJournalsLog.Close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/dump/journals.go` around lines 274 - 277, The created file
handle dumpJournalsLog opened with os.Create is never closed, causing a file
descriptor leak; after the successful creation in the block that sets
dumpJournalsLog, add a deferred close (e.g., defer dumpJournalsLog.Close())
immediately after the os.Create succeeds so the file is closed when the
surrounding function returns (optionally handle/ignore the Close() error).
hypershift-operator/controllers/hostedcluster/network_policies.go-244-249 (1)

244-249: ⚠️ Potential issue | 🟡 Minor

Duplicate error messages hinder debugging.

Both line 242 and line 248 use the same error message "failed to reconcile konnectivity nodeport network policy". When this error surfaces, it's unclear whether NodePortKonnectivityNetworkPolicy or NodePortKonnectivityKASNetworkPolicy failed.

Proposed fix for distinct error messages
 	policy = networkpolicy.NodePortKonnectivityKASNetworkPolicy(controlPlaneNamespaceName)
 	if _, err := createOrUpdate(ctx, r.Client, policy, func() error {
 		return reconcileNodePortKonnectivityKASNetworkPolicy(policy, hcluster)
 	}); err != nil {
-		return fmt.Errorf("failed to reconcile konnectivity nodeport network policy: %w", err)
+		return fmt.Errorf("failed to reconcile konnectivity KAS nodeport network policy: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 244 - 249, The error message is ambiguous because both
NodePortKonnectivityNetworkPolicy and NodePortKonnectivityKASNetworkPolicy use
the same text; update the createOrUpdate error return in the block that calls
NodePortKonnectivityKASNetworkPolicy and
reconcileNodePortKonnectivityKASNetworkPolicy to use a distinct message (e.g.,
include "KAS" or "konnectivity KAS nodeport network policy") so it clearly
identifies the failing resource when createOrUpdate (and
reconcileNodePortKonnectivityKASNetworkPolicy) errors.
ignition-server/controllers/local_ignitionprovider.go-566-580 (1)

566-580: ⚠️ Potential issue | 🟡 Minor

Close res.Body before retrying non-200 responses.

On the non-200 OK path, the function returns before the deferred close is registered, so each retry leaks a response body/connection. Move the close immediately after httpclient.Do(req) succeeds, then handle the status code.

Suggested fix
 		res, err := httpclient.Do(req)
 		if err != nil {
 			log.Error(err, "mcs request failed")
 			return false, nil
 		}
+		defer func() {
+			if err := res.Body.Close(); err != nil {
+				log.Error(err, "failed to close mcs response body")
+			}
+		}()
 		if res.StatusCode != http.StatusOK {
-			log.Error(err, "mcs returned unexpected response code", "code", res.StatusCode)
+			log.Info("mcs returned unexpected response code", "code", res.StatusCode)
 			return false, nil
 		}
-
-		defer func() {
-			if err := res.Body.Close(); err != nil {
-				log.Error(err, "failed to close mcs response body")
-			}
-		}()
 		body, err := io.ReadAll(res.Body)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ignition-server/controllers/local_ignitionprovider.go` around lines 566 -
580, The HTTP response body from httpclient.Do(req) is leaked on non-200 paths
because the function returns before res.Body.Close() is registered; move the
body close immediately after a successful httpclient.Do(req) (e.g., call defer
or explicit Close right after res is non-nil) so the body is closed on all
return paths, and update the non-200 log in the status-checking branch to not
reference the previous err variable (use res.StatusCode and include the response
or error text if needed).
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go-530-549 (1)

530-549: ⚠️ Potential issue | 🟡 Minor

Set the recovery condition Reason inside the finished/not-finished branches.

Reason is initialized to hyperv1.RecoveryFinishedReason before finished is known, so the condition can report a “finished” reason while Line 541 still marks recovery incomplete and requeues. That makes the status misleading for consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 530 - 549, The condition Reason is set before knowing finished
state which can misreport completion; move or overwrite the Reason inside the
two branches so the condition's Reason reflects the actual state: in reconcile
logic that checks finished from reconcileRestoredCluster(ctx, hcp), set
condition.Reason = hyperv1.RecoveryInProgressReason and condition.Status =
metav1.ConditionFalse (with an appropriate Message) in the !finished branch, and
set condition.Reason = hyperv1.RecoveryFinishedReason and condition.Status =
metav1.ConditionTrue (with the finished Message) in the else branch; update the
same metav1.Condition object (named condition) so only the correct
Reason/Status/Message are emitted for hyperv1.HostedClusterRestoredFromBackup.
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go-2728-2730 (1)

2728-2730: ⚠️ Potential issue | 🟡 Minor

This helper no longer verifies that the script still PATCHes the ConfigMap.

Line 2728 uses &&, but "curl" is already asserted above. If PATCH disappears from the script while curl remains, this check still passes and the regression goes unnoticed.

Suggested fix
-	if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") {
+	if !strings.Contains(script, "PATCH") {
 		t.Error("Check script should PATCH the ConfigMap on success")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 2728 - 2730, The current test uses if !strings.Contains(script,
"curl") && !strings.Contains(script, "PATCH") which only fails if BOTH "curl"
and "PATCH" are missing; since "curl" is already asserted elsewhere the test
will not catch a missing "PATCH". Update the condition to if
!strings.Contains(script, "curl") || !strings.Contains(script, "PATCH") or,
better, add a dedicated assertion that strings.Contains(script, "PATCH") to
ensure the script explicitly contains the PATCH operation (referencing the test
variable script in resources_test.go).
hypershift-operator/controllers/platform/aws/controller.go-636-643 (1)

636-643: ⚠️ Potential issue | 🟡 Minor

Stop before the success-status path when the current CR was just deleted.

When Line 641 returns after deleting the obsolete AWSEndpointService, the outer Reconcile still sets AWSEndpointServiceAvailable=True and may try to update status on a resource that is already terminating. Bubble this state up so the caller exits before the “healthy” status update.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/platform/aws/controller.go` around lines 636
- 643, When deleteObsoleteEndpointService reports it deleted the CR (done ==
true), instead of returning nil from reconcileAWSEndpointServiceStatus, return a
sentinel error (e.g. var errAWSEndpointServiceDeleted = errors.New("aws endpoint
service deleted")) and update the outer Reconcile to check for that sentinel and
exit immediately without setting AWSEndpointServiceAvailable=True or updating
status; modify reconcileAWSEndpointServiceStatus to return
errAWSEndpointServiceDeleted when done==true, add the sentinel error definition
near the top of the file, and update the caller (Reconcile) to handle that
sentinel by returning early.
test/e2e/util/reqserving/verifyenv.go-114-117 (1)

114-117: ⚠️ Potential issue | 🟡 Minor

Verify if zone count is derived from configuration or should be hardcoded.

The check len(actualCount) < 3 expects exactly 3 zones with non-request serving nodes. While setup.go validates that the cluster has at least 3 AWS availability zones and explicitly uses only the first 3 (commonZones = commonZones[:3]), this verification function hardcodes the requirement without sourcing it from the ClusterSizingConfiguration object it receives. If the zone configuration changes in the future, this check won't adapt.

Consider deriving the expected zone count from the zones actually configured for non-request serving machinesets, or document why this must remain hardcoded to 3.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/reqserving/verifyenv.go` around lines 114 - 117, The hardcoded
check using len(actualCount) < 3 should instead derive the expected zone count
from the cluster sizing configuration passed into the verification routine: read
the configured non-request-serving zones (the same source used in setup.go which
slices commonZones to 3) from the ClusterSizingConfiguration object supplied to
the verify function (or compute expectedZones by inspecting the
non-request-serving machineset definitions) and replace the literal 3 with
len(expectedZones); if the codebase intentionally limits to the first 3 zones,
add a clear comment referencing setup.go's commonZones = commonZones[:3] and
derive expectedZones by applying the same slicing logic so the verification
stays consistent with setup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 04d271b2-18e0-4ed4-a175-ace4e25c63a0

📥 Commits

Reviewing files that changed from the base of the PR and between 1d9ea73 and 08325f7.

📒 Files selected for processing (38)
  • .golangci.yml
  • cmd/cluster/core/create.go
  • cmd/fix/dr_oidc_iam.go
  • cmd/infra/aws/create.go
  • cmd/infra/azure/create.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/metrics/metrics.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/platform/aws/controller.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.go
  • hypershift-operator/main.go
  • ignition-server/controllers/local_ignitionprovider.go
  • test/e2e/util/dump/journals.go
  • test/e2e/util/reqserving/verifycp.go
  • test/e2e/util/reqserving/verifyenv.go
  • test/e2e/util/util.go

Comment thread control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go Outdated
Comment thread hypershift-operator/controllers/hostedcluster/network_policies.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (10)
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (1)

1704-1759: Consider adding t.Parallel() for consistency.

The other new tests in this PR (TestBaseDomainVNetLinkName, TestDNSZoneConfigErrMsgQualifier) use t.Parallel() at both the test and subtest level. This test and the following ones (TestDeleteBaseDomainARecords, TestMapHCPToAzurePLS) do not. Since each subtest creates its own fake client, parallel execution should be safe.

♻️ Suggested change
 func TestRecordNamesForCR(t *testing.T) {
+	t.Parallel()
 	tests := []struct {

And in the subtest:

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			t.Parallel()
 			g := NewGomegaWithT(t)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`
around lines 1704 - 1759, Add parallelization to the test TestRecordNamesForCR
by calling t.Parallel() at the start of the top-level test and inside each t.Run
subtest; update the TestRecordNamesForCR function and each anonymous subtest
closure so they invoke t.Parallel(), keeping the existing setup that creates a
fresh fake client and uses AzurePrivateLinkServiceReconciler.recordNamesForCR to
ensure safe concurrent execution.
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

671-700: Extract shared test fixtures to reduce repeated setup drift.

Scheme/client/reconciler/callback setup is duplicated across many new tests. A small fixture helper would make this file easier to maintain and less error-prone.

Also applies to: 776-807, 901-940, 1049-1084, 1128-1159, 1203-1234, 1278-1309

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 671 - 700, Extract the repeated test setup into a small helper that
builds the scheme, registers corev1 and configv1, constructs the fake client
with optional objects, and returns the ready-to-use HostedClusterReconciler (and
optionally fakeClient) so individual tests can call it instead of duplicating
the block in tests that exercise
HostedClusterReconciler.getManagementClusterNetwork; update the tests (the
blocks around getManagementClusterNetwork) to call the helper and pass
tc.objects and tc.capabilities, keeping assertions unchanged.
hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go (1)

22-30: expectError is currently unexercised in this test table.

No test case sets expectError: true, so that branch is dead and can mislead readers about coverage. Either add an injected client-error case or remove the field/branch for clarity.

✂️ Minimal cleanup if you keep only non-error-path cases
 type testCase struct {
   name               string
   job                *batchv1.Job
   jobExists          bool
   expectedExists     bool
   expectedFinished   bool
   expectedSuccessful bool
-  expectError        bool
 }
 ...
- if tt.expectError {
-   g.Expect(err).To(HaveOccurred())
- } else {
-   g.Expect(err).ToNot(HaveOccurred())
-   g.Expect(status).ToNot(BeNil())
-   g.Expect(status.exists).To(Equal(tt.expectedExists))
-   g.Expect(status.finished).To(Equal(tt.expectedFinished))
-   g.Expect(status.successful).To(Equal(tt.expectedSuccessful))
- }
+ g.Expect(err).ToNot(HaveOccurred())
+ g.Expect(status).ToNot(BeNil())
+ g.Expect(status.exists).To(Equal(tt.expectedExists))
+ g.Expect(status.finished).To(Equal(tt.expectedFinished))
+ g.Expect(status.successful).To(Equal(tt.expectedSuccessful))

Also applies to: 176-184

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go` around
lines 22 - 30, The table-driven test in etcd_recovery_test.go declares an
expectError field that is never set, leaving the error branch untested; either
add a test case that sets expectError: true and injects a client error (e.g.,
simulate API error via the fake client used by the test) to exercise the error
path, or remove the expectError field and any assertions/branches that reference
it from the tests slice and the test loop; update the tests variable and the
corresponding loop logic (the table and any helper that checks expected error
behavior) so the code and test cases are consistent.
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go (1)

924-934: Consider parameterizing the expected values in assertions.

The assertions hardcode "test-infra" instead of using tt.infraID. If additional test cases with different infraIDs are added, the assertions would fail unexpectedly or need to be duplicated.

♻️ Suggested improvement
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
 			g := NewGomegaWithT(t)
 			filters := vpcEndpointSecurityGroupFilter(tt.infraID, tt.endpointName)
 			g.Expect(filters).To(HaveLen(2))
-			g.Expect(aws.ToString(filters[0].Name)).To(Equal("tag:kubernetes.io/cluster/test-infra"))
+			g.Expect(aws.ToString(filters[0].Name)).To(Equal(fmt.Sprintf("tag:kubernetes.io/cluster/%s", tt.infraID)))
 			g.Expect(filters[0].Values).To(Equal([]string{"owned"}))
 			g.Expect(aws.ToString(filters[1].Name)).To(Equal("tag:Name"))
-			g.Expect(filters[1].Values).To(Equal([]string{"test-infra-vpce-kube-apiserver-private"}))
+			g.Expect(filters[1].Values).To(Equal([]string{vpcEndpointSecurityGroupName(tt.infraID, tt.endpointName)}))
 		})
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`
around lines 924 - 934, Update the assertions in the test loop to use tt.infraID
and tt.endpointName instead of hardcoded "test-infra" and the combined name;
specifically assert aws.ToString(filters[0].Name) equals
"tag:kubernetes.io/cluster/<tt.infraID>" and assert filters[1].Values equals
[]string{tt.infraID + "-vpce-" + tt.endpointName} so the expectations follow the
inputs used to create filters in vpcEndpointSecurityGroupFilter.
control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go (1)

261-761: Consider extracting a shared converter-test helper.

The converter tests repeat the same setup/assertion scaffold (build vmi, call converter, check err/login/challenge, cast provider). A small helper would reduce maintenance cost and make provider-specific expectations stand out.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go`
around lines 261 - 761, Extract a reusable test helper (e.g.,
runIDPConverterTest) to factor the repeated scaffold used across
TestConvertBasicAuthIDP, TestConvertGitHubIDP, TestConvertGitLabIDP,
TestConvertGoogleIDP, TestConvertHTPasswdIDP, TestConvertKeystoneIDP,
TestConvertLDAPIDP and TestConvertRequestHeaderIDP: the helper should create
newTestVolumeMountInfo(), invoke the provided converter function
(convertBasicAuthIDP / convertGitHubIDP / convertGitLabIDP / convertGoogleIDP /
convertHTPasswdIDP / convertKeystoneIDP / convertLDAPIDP /
convertRequestHeaderIDP) with the test config and index, assert expectation on
error vs no error and on data.login/data.challenge, and return the resulting
provider (typed as runtime.Object or interface{}) for per-case assertions;
replace the duplicated scaffolding in each test with a single call to this
helper and then only keep the provider-specific Expect checks (e.g., asserting
URLs, CA paths, secret file names, Legacy flag, Attributes, etc.).
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (2)

909-917: Assert the full selector map here.

findServiceForMonitor returns the matched Service's selector from scrape_config.go, but this test only checks that a few keys exist. A regression that returned the wrong selector values—or even the ServiceMonitor selector instead of svc.Spec.Selector—would still pass as long as the key names matched. Please store an expectedSelector map in the table and compare it directly.

Suggested tightening
 	tests := []struct {
 		name            string
 		objects         []runtime.Object
 		smName          string
 		selector        metav1.LabelSelector
 		expectedSvc     string
 		expectedErr     bool
-		expectedSelKeys []string
+		expectedSelector map[string]string
 	}{
 		{
 			name: "When service exists with same name as ServiceMonitor, it should find it by direct lookup",
 			objects: []runtime.Object{
 				newService("kube-apiserver", namespace, "kube-apiserver", "client", 6443),
 			},
-			smName:          "kube-apiserver",
-			selector:        metav1.LabelSelector{MatchLabels: map[string]string{"app": "kube-apiserver"}},
-			expectedSvc:     "kube-apiserver",
-			expectedErr:     false,
-			expectedSelKeys: []string{"app"},
+			smName:           "kube-apiserver",
+			selector:         metav1.LabelSelector{MatchLabels: map[string]string{"app": "kube-apiserver"}},
+			expectedSvc:      "kube-apiserver",
+			expectedErr:      false,
+			expectedSelector: map[string]string{"app": "kube-apiserver"},
 		},
@@
 			} else {
 				g.Expect(err).ToNot(HaveOccurred())
 				g.Expect(svcName).To(Equal(tc.expectedSvc))
-				for _, key := range tc.expectedSelKeys {
-					g.Expect(podSelector).To(HaveKey(key))
-				}
+				g.Expect(podSelector).To(Equal(tc.expectedSelector))
 			}
 		})
 	}

Also applies to: 962-970

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`
around lines 909 - 917, The test for findServiceForMonitor currently only
asserts presence of selector keys; add a full expectedSelector map field to each
table entry and replace the partial key-only checks with an equality comparison
against svc.Spec.Selector (or the returned selector) to ensure values and keys
match exactly; update the table structs to include expectedSelector
map[string]string, populate it for each case, and assert deep equality (not just
key existence) for the selector returned by findServiceForMonitor and similarly
update the related cases around the other test block.

981-1048: Add a case for matching on TargetPort.String().

The helper under test in control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go:282-307 accepts both p.Name == portName and p.TargetPort.String() == portName, but this table only exercises the first form. A case where the service port is named "client" and the lookup asks for "https" would cover the second branch and better protect this refactor.

Suggested test case
 		{
 			name: "When service has a named targetPort, it should resolve from a pod",
 			objects: []runtime.Object{
 				newServiceWithTargetPort("my-svc", namespace, "app", "client", 443, intstr.FromString("https")),
 				newPodWithLabels("my-pod", namespace, "https", 6443, map[string]string{"app": "app"}),
 			},
 			serviceName:  "my-svc",
 			portName:     "client",
 			podSelector:  map[string]string{"app": "app"},
 			expectedPort: 6443,
 			expectedErr:  false,
 		},
+		{
+			name: "When lookup matches the named targetPort, it should resolve from a pod",
+			objects: []runtime.Object{
+				newServiceWithTargetPort("my-svc", namespace, "app", "client", 443, intstr.FromString("https")),
+				newPodWithLabels("my-pod", namespace, "https", 6443, map[string]string{"app": "app"}),
+			},
+			serviceName:  "my-svc",
+			portName:     "https",
+			podSelector:  map[string]string{"app": "app"},
+			expectedPort: 6443,
+			expectedErr:  false,
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`
around lines 981 - 1048, The table-driven tests do not cover the branch where a
ServicePort is matched via p.TargetPort.String() (the helper in scrape_config.go
checks both p.Name == portName and p.TargetPort.String() == portName), so add a
test case to the tests slice that uses newServiceWithTargetPort to create a
Service with Name "client" and TargetPort intstr.FromString("https"), add a
corresponding pod via newPodWithLabels that has the target container port (e.g.
6443), set serviceName to "my-svc" and portName to "https", and expect
expectedPort 6443 and expectedErr false to ensure the TargetPort.String()
matching branch is exercised.
hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go (2)

2206-2209: Strengthen TestClearTransientConditions assertions.

The current checks only validate non-nil outputs. This can miss regressions in condition type/status/reason/message/transition-time content.

Suggested assertion tightening
 			} else {
 				g.Expect(result).ToNot(BeNil())
 				g.Expect(result.applyCfg).ToNot(BeNil())
 				g.Expect(result.applyCfg.Status).ToNot(BeNil())
+				g.Expect(result.applyCfg.Status.Conditions).To(HaveLen(2))
+				g.Expect(*result.applyCfg.Status.Conditions[0].Type).To(Equal(hypershiftv1beta1.ClusterSizeTransitionPending))
+				g.Expect(*result.applyCfg.Status.Conditions[0].Status).To(Equal(metav1.ConditionFalse))
+				g.Expect(*result.applyCfg.Status.Conditions[0].Reason).To(Equal("ClusterSizeTransitioned"))
+				g.Expect(*result.applyCfg.Status.Conditions[1].Type).To(Equal(hypershiftv1beta1.ClusterSizeTransitionRequired))
+				g.Expect(*result.applyCfg.Status.Conditions[1].Status).To(Equal(metav1.ConditionFalse))
+				g.Expect(*result.applyCfg.Status.Conditions[1].Reason).To(Equal(hypershiftv1beta1.AsExpectedReason))
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go`
around lines 2206 - 2209, TestClearTransientConditions currently only asserts
non-nil for result, result.applyCfg and result.applyCfg.Status; tighten it by
asserting that result.applyCfg.Status.Conditions has the expected length and
that each Condition's Type, Status, Reason, Message and LastTransitionTime (or
its non-zero state) match the expected values for the test case. Locate the
test's result variable and its applyCfg.Status.Conditions slice and replace the
loose non-nil checks with explicit checks on Conditions[i].Type,
Conditions[i].Status, Conditions[i].Reason, Conditions[i].Message and that
Conditions[i].LastTransitionTime is set (or equals the expected timestamp) so
regressions in condition content will fail the test.

1991-1991: Use identity assertions for all noOpAction sentinel comparisons.

At lines 1991, 1994, 2134, and 2137, Equal(&noOpAction) checks deep value equality, which can pass for non-sentinel zero-value actions. Since noOpAction is a sentinel, assert identity instead for both positive and negative comparisons.

Suggested changes
-			g.Expect(result).To(Equal(&noOpAction))
+			g.Expect(result).To(BeIdenticalTo(&noOpAction))
-			g.Expect(result).ToNot(Equal(&noOpAction))
+			g.Expect(result).ToNot(BeIdenticalTo(&noOpAction))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go`
at line 1991, Replace deep-equality checks against the sentinel noOpAction with
identity assertions: wherever the tests compare result to &noOpAction (e.g.,
g.Expect(result).To(Equal(&noOpAction))) use Gomega's identity matcher
g.Expect(result).To(BeIdenticalTo(&noOpAction)), and for negative checks use
g.Expect(result).ToNot(BeIdenticalTo(&noOpAction)) or
g.Expect(result).To(Not(BeIdenticalTo(&noOpAction))). Update all occurrences
that reference result and noOpAction so the sentinel is asserted by identity,
not deep equality.
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (1)

3250-3273: Platform-specific resource test cases need concrete object assertions

TestReconcilePlatformSpecificResources mostly validates “no errors” and has empty verification bodies for None/KubeVirt. That makes routing regressions easy to miss.

Suggested direction
 		{
 			name:         "When platform is AWS, it should reconcile AWS identity webhook resources",
 			platformType: hyperv1.AWSPlatform,
 			verifyObjects: func(g *WithT, c client.Client) {
 				saList := &corev1.ServiceAccountList{}
 				err := c.List(ctx, saList)
 				g.Expect(err).ToNot(HaveOccurred())
+				g.Expect(saList.Items).ToNot(BeEmpty(), "expected AWS platform-specific resources to be created")
 			},
 		},
 		{
 			name:         "When platform is None, it should not create any platform-specific resources",
 			platformType: hyperv1.NonePlatform,
 			verifyObjects: func(g *WithT, c client.Client) {
-				// No platform-specific resources expected
+				saList := &corev1.ServiceAccountList{}
+				g.Expect(c.List(ctx, saList)).To(Succeed())
+				g.Expect(saList.Items).To(BeEmpty())
 			},
 		},
 		{
 			name:         "When platform is KubeVirt, it should not create AWS or Azure resources",
 			platformType: hyperv1.KubevirtPlatform,
 			verifyObjects: func(g *WithT, c client.Client) {
-				// KubeVirt is not in the switch; no platform resources should be created
+				saList := &corev1.ServiceAccountList{}
+				g.Expect(c.List(ctx, saList)).To(Succeed())
+				g.Expect(saList.Items).To(BeEmpty())
 			},
 		},

Also applies to: 3309-3311

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 3250 - 3273, Update TestReconcilePlatformSpecificResources'
verifyObjects closures to perform concrete assertions instead of being empty: in
the AWS case (the closure that currently creates saList) explicitly assert that
expected AWS resources exist by listing and asserting non-zero results for types
used by the identity webhook (e.g., corev1.ServiceAccountList and
admissionregistrationv1.MutatingWebhookConfigurationList referenced in the
test), and in the None and KubeVirt cases list those same resource types (e.g.,
ServiceAccountList and MutatingWebhookConfigurationList) and assert their counts
are zero to ensure no platform-specific resources were created; modify the
existing verifyObjects closures (the anonymous funcs passed into
TestReconcilePlatformSpecificResources) to perform these positive/negative
existence checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/cluster/core/create_test.go`:
- Around line 949-1024: The "no-op/default" test cases in TestApplyEtcdConfig
don't explicitly assert the cluster's untouched state; update the test to, when
tc.etcdStorageClass and tc.etcdStorageSize are empty, assert that
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName remains nil
(or the original default) and that Size.String() equals
hyperv1.DefaultPersistentVolumeEtcdStorageSize.String() (i.e., the size was not
overwritten), and apply the same explicit assertions to the other similar tests
that exercise applyEtcdConfig (and related tests noted in the review) so
regressions that initialize or overwrite fields will fail the tests.
- Around line 921-924: Replace the test fixture string "key=val=ue" in the test
case named "When a value contains an equals sign, it should split only on the
first equals" so it no longer triggers codespell; update items from
[]string{"key=val=ue"} to a harmless alternate that still contains an equals
(for example []string{"key=value=extra"}) and adjust the expected map value from
map[string]string{"key":"val=ue"} to map[string]string{"key":"value=extra"} so
the parsing behavior tested remains identical.

In `@cmd/install/install_test.go`:
- Around line 1052-1094: TestSetupRBAC currently only asserts a minimum object
count via HaveLen which can miss missing admin RBAC resources; update the test
that calls setupRBAC (using Options.EnableAdminRBACGeneration and
AzurePLSManagedIdentityClientID) to explicitly assert the presence of the admin
RBAC objects when EnableAdminRBACGeneration is true (e.g., check for
Role/RoleBinding/ServiceAccount names like "hypershift-client" and
"hypershift-readers" in the returned objects), and change the length assertion
to a floor check (use BeNumerically(">=", tc.minObjectCount) or assert
len(objects) >= tc.minObjectCount) rather than HaveLen so additions won’t break
the test; also keep the existing SA annotation check for
AzurePLSManagedIdentityClientID.
- Around line 1005-1049: The table-driven TestSetupMonitoring currently ignores
expectSLOAlerts and expectDashboards and asserts exact length; update the test
(in TestSetupMonitoring) to 1) use a lower-bound smoke check
(g.Expect(len(objects)).To(BeNumerically(">=", tc.minResourceCount))) instead of
HaveLen, and 2) assert presence/absence of the specific monitoring resources
described by the table: when tc.expectSLOAlerts is true, verify setupMonitoring
returns at least one PrometheusRule-like object (check
object.GetObjectKind().GroupVersionKind().Kind == "PrometheusRule" or
reflect.Type/name contains "PrometheusRule"); when false, assert no such object
exists; when tc.expectDashboards is true, verify a dashboard-like resource is
present (e.g., a Dashboard CR or a ConfigMap whose Name/labels indicate a
dashboard for tc.opts.Namespace), and when false assert none; keep using
NewGomegaWithT and the setupMonitoring/Options values to locate
namespace-specific dashboard checks.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`:
- Around line 3980-4013: The test currently only asserts an error when the List
interceptor forces a failure; update the "When component list fails..." branch
to also fetch the HCP from the fake client after calling
reconcileControlPlaneVersionStatus and assert that the
HCP.Status.ControlPlaneVersion (and/or related partial-version fields) was
written (i.e., contains a non-empty Version or Partial=true/expected condition)
despite the error. Use the same client c (or c.Status()) to Get the latest
tc.hcp by namespaced name and add Gomega assertions that the status fields have
the expected partial values; keep these checks alongside the existing err
expectation in the test that calls reconcileControlPlaneVersionStatus.

In
`@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go`:
- Around line 325-326: The test assertions in idp_convert_test.go are checking
for underscored helper names like "idp_cm_0_ca"/"idp_secret_0" but the
production helpers IDPVolumeMountInfo.ConfigMapPath and
IDPVolumeMountInfo.SecretPath generate hyphenated names ("idp-cm-0",
"idp-secret-0"), so update the expectations in the test (references to
provider.RemoteConnectionInfo.CA, provider.RemoteConnectionInfo.Secret, and any
similar assertions) to use the hyphenated forms (e.g., "idp-cm-0-ca",
"idp-secret-0") to match the values produced by
IDPVolumeMountInfo.ConfigMapPath/SecretPath; apply the same change to all other
similar assertions in this test file.
- Line 850: The test currently dereferences vmi.Volumes[0].Secret.DefaultMode
directly which can panic if nil; change the assertion to first assert the
pointer is non-nil (e.g.,
Expect(vmi.Volumes[0].Secret.DefaultMode).ToNot(BeNil())) and then check the
value, or use a pointer-aware matcher such as PointTo to assert the pointed
value equals int32(0640)); locate the assertion referencing
vmi.Volumes[0].Secret.DefaultMode in the idp_convert_test.go and replace the
direct dereference with the safer two-step or PointTo-style assertion.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go`:
- Around line 292-327: The test checks CA mount path but misses TLS cert/key
paths; update the test in convertBasicAuthIDP table run to also assert that
provider.RemoteConnectionInfo.CertFile and provider.RemoteConnectionInfo.KeyFile
(from convertBasicAuthIDP) contain the expected mounted secret names when
tc.config.BasicAuth.TLSClientCert/TLSClientKey are set (e.g. contain substrings
like "idp_secret_0_cert" and "idp_secret_0_key" or the naming your mount logic
uses). Locate the assertions after provider :=
data.provider.(*osinv1.BasicAuthPasswordIdentityProvider) and add checks similar
to the existing CA check to validate CertFile and KeyFile mount path wiring.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 2732-2734: The test currently allows a script that contains "curl"
but lacks "PATCH" to pass because it uses a logical AND; update the assertion in
resources_test.go so it fails if either token is missing by changing the
condition that checks the variable "script" to use a logical OR (i.e., assert
that both "curl" and "PATCH" are present), and keep the same error message
"Check script should PATCH the ConfigMap on success".
- Around line 3197-3234: When tt.expectCloudCleanup is true, add a positive
assertion after calling r.reconcileDeletion that the cloud-cleanup path actually
ran by asserting the returned result signals a requeue (e.g., check
result.RequeueAfter is non-zero/greater than 0); locate the block that creates
kasDeployment and the call to r.reconcileDeletion and insert the assertion about
result.RequeueAfter there so regressions to the cleanup gate are caught.
- Around line 3216-3230: The test currently only checks VAP deletion conditioned
on tt.expectVAPDeleted, which ignores tt.expectVAPBDeleted; update the
assertions to independently verify the binding by checking
manifests.ValidatingAdmissionPolicyBinding(fmt.Sprintf("%s-binding",
registry.AdmissionPolicyNameManagementState)) against guestClient.Get depending
on tt.expectVAPBDeleted (i.e., assert IsNotFound when tt.expectVAPBDeleted is
true), and when neither expectVAPDeleted nor expectVAPBDeleted is true (and
len(tt.existingObjects) > 0) assert that the binding Get does not return an
error, mirroring the existing VAP existence checks.

In
`@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go`:
- Line 1423: Update the test case name string that currently reads "When min
interval is unparseable, it should allow snapshot creation" to correct the
spelling to "unparsable" so codespell no longer flags it; locate the test case's
name field in snapshot_controller_test.go (the test case struct/name entry) and
replace "unparseable" with "unparsable".

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go`:
- Around line 534-536: The RoleBinding assertions currently check
roleBinding.RoleRef.Kind and .Name but omit the API group; update the test in
etcd_recovery_test.go to also assert roleBinding.RoleRef.APIGroup equals the
expected API group (e.g. "rbac.authorization.k8s.io") so the RoleRef contract is
fully validated alongside the existing checks for roleBinding.RoleRef.Kind,
roleBinding.RoleRef.Name and roleBinding.Subjects length.
- Around line 517-519: The test currently asserts role.Rules[1].Resources and
Verbs for the second RBAC rule (statefulsets) but misses validating APIGroups;
update the assertion for role.Rules[1] to also assert APIGroups contains the
correct API group for statefulsets (e.g.,
g.Expect(role.Rules[1].APIGroups).To(ConsistOf("apps"))) so the rule is fully
validated alongside the existing Resources and Verbs checks.
- Around line 490-497: The test currently only asserts presence of
EtcdRecoveryActive when tt.expectCondition is true; add a negative assertion for
the false branch by fetching the updated HostedCluster (same pattern using
fakeClient.Get and updatedHC := &hyperv1.HostedCluster{}), then call
findCondition(updatedHC.Status.Conditions, string(hyperv1.EtcdRecoveryActive))
and assert that it is nil when tt.expectCondition is false so the test fails if
the condition is erroneously present.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 788-796: The createOrUpdate test stub (the upsert.CreateOrUpdateFN
assigned to createOrUpdate in network_policies_test.go) currently treats
non-*networkingv1.NetworkPolicy objects as success and swallows unexpected
types; update the stub so it validates the obj type, returns a clear error (and
controllerutil.OperationResultNone) when obj is not a
*networkingv1.NetworkPolicy, and keep the existing behavior of calling the
mutate function, recording into createdPolicies (or the equivalent map) and
returning OperationResultCreated only for valid NetworkPolicy objects; apply the
same change to each identical stub instance (the ones creating createOrUpdate
around createdPolicies at the listed locations).

---

Nitpick comments:
In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go`:
- Around line 924-934: Update the assertions in the test loop to use tt.infraID
and tt.endpointName instead of hardcoded "test-infra" and the combined name;
specifically assert aws.ToString(filters[0].Name) equals
"tag:kubernetes.io/cluster/<tt.infraID>" and assert filters[1].Values equals
[]string{tt.infraID + "-vpce-" + tt.endpointName} so the expectations follow the
inputs used to create filters in vpcEndpointSecurityGroupFilter.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`:
- Around line 1704-1759: Add parallelization to the test TestRecordNamesForCR by
calling t.Parallel() at the start of the top-level test and inside each t.Run
subtest; update the TestRecordNamesForCR function and each anonymous subtest
closure so they invoke t.Parallel(), keeping the existing setup that creates a
fresh fake client and uses AzurePrivateLinkServiceReconciler.recordNamesForCR to
ensure safe concurrent execution.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 909-917: The test for findServiceForMonitor currently only asserts
presence of selector keys; add a full expectedSelector map field to each table
entry and replace the partial key-only checks with an equality comparison
against svc.Spec.Selector (or the returned selector) to ensure values and keys
match exactly; update the table structs to include expectedSelector
map[string]string, populate it for each case, and assert deep equality (not just
key existence) for the selector returned by findServiceForMonitor and similarly
update the related cases around the other test block.
- Around line 981-1048: The table-driven tests do not cover the branch where a
ServicePort is matched via p.TargetPort.String() (the helper in scrape_config.go
checks both p.Name == portName and p.TargetPort.String() == portName), so add a
test case to the tests slice that uses newServiceWithTargetPort to create a
Service with Name "client" and TargetPort intstr.FromString("https"), add a
corresponding pod via newPodWithLabels that has the target container port (e.g.
6443), set serviceName to "my-svc" and portName to "https", and expect
expectedPort 6443 and expectedErr false to ensure the TargetPort.String()
matching branch is exercised.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go`:
- Around line 261-761: Extract a reusable test helper (e.g.,
runIDPConverterTest) to factor the repeated scaffold used across
TestConvertBasicAuthIDP, TestConvertGitHubIDP, TestConvertGitLabIDP,
TestConvertGoogleIDP, TestConvertHTPasswdIDP, TestConvertKeystoneIDP,
TestConvertLDAPIDP and TestConvertRequestHeaderIDP: the helper should create
newTestVolumeMountInfo(), invoke the provided converter function
(convertBasicAuthIDP / convertGitHubIDP / convertGitLabIDP / convertGoogleIDP /
convertHTPasswdIDP / convertKeystoneIDP / convertLDAPIDP /
convertRequestHeaderIDP) with the test config and index, assert expectation on
error vs no error and on data.login/data.challenge, and return the resulting
provider (typed as runtime.Object or interface{}) for per-case assertions;
replace the duplicated scaffolding in each test with a single call to this
helper and then only keep the provider-specific Expect checks (e.g., asserting
URLs, CA paths, secret file names, Legacy flag, Attributes, etc.).

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 3250-3273: Update TestReconcilePlatformSpecificResources'
verifyObjects closures to perform concrete assertions instead of being empty: in
the AWS case (the closure that currently creates saList) explicitly assert that
expected AWS resources exist by listing and asserting non-zero results for types
used by the identity webhook (e.g., corev1.ServiceAccountList and
admissionregistrationv1.MutatingWebhookConfigurationList referenced in the
test), and in the None and KubeVirt cases list those same resource types (e.g.,
ServiceAccountList and MutatingWebhookConfigurationList) and assert their counts
are zero to ensure no platform-specific resources were created; modify the
existing verifyObjects closures (the anonymous funcs passed into
TestReconcilePlatformSpecificResources) to perform these positive/negative
existence checks.

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go`:
- Around line 22-30: The table-driven test in etcd_recovery_test.go declares an
expectError field that is never set, leaving the error branch untested; either
add a test case that sets expectError: true and injects a client error (e.g.,
simulate API error via the fake client used by the test) to exercise the error
path, or remove the expectError field and any assertions/branches that reference
it from the tests slice and the test loop; update the tests variable and the
corresponding loop logic (the table and any helper that checks expected error
behavior) so the code and test cases are consistent.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Around line 671-700: Extract the repeated test setup into a small helper that
builds the scheme, registers corev1 and configv1, constructs the fake client
with optional objects, and returns the ready-to-use HostedClusterReconciler (and
optionally fakeClient) so individual tests can call it instead of duplicating
the block in tests that exercise
HostedClusterReconciler.getManagementClusterNetwork; update the tests (the
blocks around getManagementClusterNetwork) to call the helper and pass
tc.objects and tc.capabilities, keeping assertions unchanged.

In
`@hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go`:
- Around line 2206-2209: TestClearTransientConditions currently only asserts
non-nil for result, result.applyCfg and result.applyCfg.Status; tighten it by
asserting that result.applyCfg.Status.Conditions has the expected length and
that each Condition's Type, Status, Reason, Message and LastTransitionTime (or
its non-zero state) match the expected values for the test case. Locate the
test's result variable and its applyCfg.Status.Conditions slice and replace the
loose non-nil checks with explicit checks on Conditions[i].Type,
Conditions[i].Status, Conditions[i].Reason, Conditions[i].Message and that
Conditions[i].LastTransitionTime is set (or equals the expected timestamp) so
regressions in condition content will fail the test.
- Line 1991: Replace deep-equality checks against the sentinel noOpAction with
identity assertions: wherever the tests compare result to &noOpAction (e.g.,
g.Expect(result).To(Equal(&noOpAction))) use Gomega's identity matcher
g.Expect(result).To(BeIdenticalTo(&noOpAction)), and for negative checks use
g.Expect(result).ToNot(BeIdenticalTo(&noOpAction)) or
g.Expect(result).To(Not(BeIdenticalTo(&noOpAction))). Update all occurrences
that reference result and noOpAction so the sentinel is asserted by identity,
not deep equality.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 56392441-83ef-4758-9a14-1ab2d72d0d01

📥 Commits

Reviewing files that changed from the base of the PR and between 08325f7 and d345200.

📒 Files selected for processing (21)
  • cmd/cluster/core/create_test.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/install_test.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/platform/aws/controller_test.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler_test.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.go
  • ignition-server/controllers/local_ignitionprovider_test.go

Comment thread cmd/cluster/core/create_test.go Outdated
Comment thread cmd/cluster/core/create_test.go
Comment thread cmd/install/install_test.go
Comment thread cmd/install/install_test.go
Comment thread hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go Outdated
Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
@bryan-cox bryan-cox changed the title feat: enable gocyclo linter and reduce cyclomatic complexity CNTRLPLANE-3324: Enable gocyclo linter and reduce cyclomatic complexity Apr 23, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 23, 2026

@bryan-cox: This pull request references CNTRLPLANE-3324 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Enables the gocyclo linter in golangci-lint to enforce a maximum cyclomatic complexity threshold of 30 across the codebase. This improves code maintainability by ensuring functions stay focused and testable.

Changes:

  • Added gocyclo to the enabled linters in .golangci.yml
  • Refactored ~35 files to extract focused helper methods from functions exceeding the complexity threshold
  • Added //nolint:gocyclo exceptions for 4 inherently complex reconciliation functions:
  • HostedClusterReconciler.reconcile (complexity 318) — core HC reconciliation loop
  • HostedClusterReconciler.delete (complexity 41) — HC deletion handler
  • GetPlatform (complexity 35) — platform factory switch
  • NodePoolReconciler.reconcile (complexity 33) — core NP reconciliation loop
  • Fixed all new unparam lint warnings introduced by the refactoring (removed unused parameters, simplified return types)
  • Fixed a missing error check for NewCommonRegistryProvider exposed by the refactoring

Verification:

  • make lint passes with 0 issues
  • make test passes with 0 failures
  • go build ./... compiles cleanly

Which issue(s) this PR fixes:

Special notes for your reviewer:

This is a pure refactoring PR — no behavioral changes. All extracted helpers preserve the exact semantics of the original inline code. The one piece of genuinely new logic is a noOpAction sentinel in the sizing controller to correctly handle the "conditions already match, stop processing" case that was previously handled by early returns in the monolithic function.

The 4 nolint exceptions are for functions whose complexity is inherent to their role as top-level orchestrators. Breaking them down further would obscure the reconciliation flow rather than improve it.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor

  • Broad internal restructuring: many controllers, reconcilers, operators and utilities decomposed into focused helper functions for clearer, more maintainable control flow.

  • Tests

  • Large expansion of unit and integration tests across install, controllers, reconcilers, and utilities to improve coverage and reliability.

  • Chores

  • Static analysis tightened by enabling cyclomatic-complexity checks in linting configuration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cmd/cluster/core/create_test.go (1)

1014-1020: ⚠️ Potential issue | 🟡 Minor

No-op/default rows still don’t assert untouched state.

The “leave defaults unchanged / should not modify the cluster” cases still only assert positive paths. Please add explicit else assertions so unintended initialization/overwrites fail tests.

🧪 Minimal tightening
@@
 			} else {
 				g.Expect(err).NotTo(HaveOccurred())
 				if tc.expectStorageClass != nil {
 					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(Equal(tc.expectStorageClass))
+				} else {
+					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(BeNil())
 				}
 				if tc.expectSizeString != "" {
 					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).To(Equal(tc.expectSizeString))
+				} else {
+					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).
+						To(Equal(hyperv1.DefaultPersistentVolumeEtcdStorageSize.String()))
 				}
 			}
@@
 			if tc.expectOperatorHub {
 				g.Expect(cluster.Spec.Configuration.OperatorHub).NotTo(BeNil())
 				g.Expect(cluster.Spec.Configuration.OperatorHub.DisableAllDefaultSources).To(BeTrue())
+			} else {
+				g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil())
 			}
 			if tc.expectCatalogPlacement != "" {
 				g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal(tc.expectCatalogPlacement))
+			} else {
+				g.Expect(cluster.Spec.OLMCatalogPlacement).To(BeEmpty())
 			}
@@
 			if tc.expectNodeSelector != nil {
 				g.Expect(cluster.Spec.NodeSelector).To(Equal(tc.expectNodeSelector))
+			} else {
+				g.Expect(cluster.Spec.NodeSelector).To(BeNil())
 			}
 			if tc.expectLabels != nil {
 				g.Expect(cluster.Spec.Labels).To(Equal(tc.expectLabels))
+			} else {
+				g.Expect(cluster.Spec.Labels).To(BeNil())
 			}

Also applies to: 1135-1141, 1332-1337

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create_test.go` around lines 1014 - 1020, The test currently
only asserts fields when tc.expectStorageClass or tc.expectSizeString are set;
add explicit else branches so the "no-op/default" cases fail on unintended
changes: after the existing if blocks for
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and
Size.String(), add corresponding else clauses that assert
Storage.PersistentVolume.StorageClassName is nil/empty (e.g.,
Expect(...).To(BeNil() or Equal(""))) and that the Size is zero/empty (e.g.,
Expect(...).To(BeZero() or Equal("")) or Expect(...Size.IsZero()).To(BeTrue())).
Do this around the same checks (the blocks referencing
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and
.Size.String()) and repeat the pattern for the other occurrences noted (the
similar assertions at the other ranges).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go`:
- Around line 1369-1375: The test reads tt.annotations[...] after
getLastObservedRestartCount mutates pod.Annotations (they share the same map),
so the corrupted-annotation branch can be skipped; fix by capturing the original
annotation value before calling getLastObservedRestartCount (e.g., store
tt.annotations[lastObservedRestartCountAnnotation] or a copy of tt.annotations
into a local variable) and use that captured value in the conditional that
verifies updatedPod.Annotations[lastObservedRestartCountAnnotation] equals "0";
reference getLastObservedRestartCount, lastObservedRestartCountAnnotation,
tt.annotations, pod.Annotations and the fakeClient.Get/updatedPod check when
making the change.

---

Duplicate comments:
In `@cmd/cluster/core/create_test.go`:
- Around line 1014-1020: The test currently only asserts fields when
tc.expectStorageClass or tc.expectSizeString are set; add explicit else branches
so the "no-op/default" cases fail on unintended changes: after the existing if
blocks for cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName
and Size.String(), add corresponding else clauses that assert
Storage.PersistentVolume.StorageClassName is nil/empty (e.g.,
Expect(...).To(BeNil() or Equal(""))) and that the Size is zero/empty (e.g.,
Expect(...).To(BeZero() or Equal("")) or Expect(...Size.IsZero()).To(BeTrue())).
Do this around the same checks (the blocks referencing
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and
.Size.String()) and repeat the pattern for the other occurrences noted (the
similar assertions at the other ranges).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ac5353b0-dcb8-4206-b669-fce20cbe753e

📥 Commits

Reviewing files that changed from the base of the PR and between d345200 and ee0e3be.

📒 Files selected for processing (2)
  • cmd/cluster/core/create_test.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go

@bryan-cox bryan-cox force-pushed the cyclo-magic branch 2 times, most recently from c52964e to 47afd7e Compare April 23, 2026 11:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.go (1)

104-109: ⚠️ Potential issue | 🔴 Critical

Non-NotFound errors are silently ignored.

When r.Get returns an error other than NotFound, the function continues with an uninitialized etcdStatefulSet. This will have zero-valued status fields (ReadyReplicas == 0, AvailableReplicas == 0), potentially triggering incorrect recovery behavior on transient API errors.

🐛 Proposed fix to return non-NotFound errors
 if err := r.Get(ctx, crclient.ObjectKeyFromObject(etcdStatefulSet), etcdStatefulSet); err != nil {
     if apierrors.IsNotFound(err) {
         log.Info("etcd statefulset does not yet exist")
         return nil, nil
     }
+    return nil, fmt.Errorf("failed to get etcd statefulset: %w", err)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines
104 - 109, The current r.Get call in etcd_recovery.go silently ignores
non-NotFound errors and proceeds with an uninitialized etcdStatefulSet; update
the error handling for r.Get(ctx, crclient.ObjectKeyFromObject(etcdStatefulSet),
etcdStatefulSet) so that if apierrors.IsNotFound(err) you log and return (nil,
nil) as now, but for any other err you immediately return the error (e.g.,
return nil, err) instead of continuing; this ensures functions like the caller
that inspect etcdStatefulSet.Status (ReadyReplicas/AvailableReplicas) don’t
operate on zero-valued fields when the API returned a transient error.
hypershift-operator/controllers/hostedcluster/metrics/metrics.go (1)

259-287: ⚠️ Potential issue | 🟠 Major

Add synchronization to protect lastCollectTime from concurrent access.

The Collect method may be invoked concurrently by the Prometheus registry. Without synchronization, overlapping scrapes can race on lastCollectTime, causing missed or duplicated transition duration histogram observations.

Suggested fix
 import (
 	"context"
 	"fmt"
+	"sync"
 	"time"
@@
 type hostedClustersMetricsCollector struct {
 	client.Client
 	clock clock.Clock
+	mu    sync.Mutex
 
 	transitionDurationMetric *prometheus.HistogramVec
 
 	lastCollectTime time.Time
 }
@@
 func (c *hostedClustersMetricsCollector) Collect(ch chan<- prometheus.Metric) {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+
 	currentCollectTime := c.clock.Now()
 	log := ctrllog.Log
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go` around
lines 259 - 287, The Collect method reads and updates the
hostedClustersMetricsCollector.lastCollectTime without synchronization; add a
mutex (e.g., sync.Mutex or sync.RWMutex) as a field on
hostedClustersMetricsCollector (e.g., mu) and use it to protect all concurrent
accesses to lastCollectTime: acquire the lock when reading lastCollectTime
inside Collect and in any other method that reads it (notably
collectTransitionDurationMetrics if it reads lastCollectTime), and hold the lock
briefly when assigning c.lastCollectTime = currentCollectTime at the end of
Collect; release the lock promptly to avoid long-held locks during
listing/metric emission.
♻️ Duplicate comments (14)
cmd/cluster/core/create_test.go (3)

1135-1141: ⚠️ Potential issue | 🟡 Minor

No-op OLM case should explicitly verify untouched fields.

The "should not modify the cluster" row does not assert the negative path, so unexpected initialization of OperatorHub or OLMCatalogPlacement would still pass. (Line 1135 and Line 1139)

✅ Minimal tightening
 			applyOLMConfig(cluster, opts)
 			if tc.expectOperatorHub {
 				g.Expect(cluster.Spec.Configuration.OperatorHub).NotTo(BeNil())
 				g.Expect(cluster.Spec.Configuration.OperatorHub.DisableAllDefaultSources).To(BeTrue())
+			} else {
+				g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil())
 			}
 			if tc.expectCatalogPlacement != "" {
 				g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal(tc.expectCatalogPlacement))
+			} else {
+				g.Expect(cluster.Spec.OLMCatalogPlacement).To(BeEmpty())
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create_test.go` around lines 1135 - 1141, The test doesn't
assert the negative paths for the "should not modify the cluster" case; update
the assertions around cluster.Spec.Configuration.OperatorHub and
cluster.Spec.OLMCatalogPlacement so that when tc.expectOperatorHub is false you
assert OperatorHub is nil (e.g.,
g.Expect(cluster.Spec.Configuration.OperatorHub).To(BeNil())), and when
tc.expectCatalogPlacement is empty you assert cluster.Spec.OLMCatalogPlacement
is empty/zero value (e.g.,
g.Expect(cluster.Spec.OLMCatalogPlacement).To(Equal("" )) ), using the existing
tc.expectOperatorHub and tc.expectCatalogPlacement checks to drive the
conditional expectations.

1015-1020: ⚠️ Potential issue | 🟡 Minor

No-op etcd case still does not assert unchanged state.

For the "leave defaults unchanged" row, the test still skips assertions when expected fields are empty, so a regression that mutates defaults can pass undetected. (Line 1015 and Line 1018)

✅ Minimal tightening
 			} else {
 				g.Expect(err).NotTo(HaveOccurred())
 				if tc.expectStorageClass != nil {
 					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(Equal(tc.expectStorageClass))
+				} else {
+					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName).To(BeNil())
 				}
 				if tc.expectSizeString != "" {
 					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).To(Equal(tc.expectSizeString))
+				} else {
+					g.Expect(cluster.Spec.Etcd.Managed.Storage.PersistentVolume.Size.String()).
+						To(Equal(hyperv1.DefaultPersistentVolumeEtcdStorageSize.String()))
 				}
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create_test.go` around lines 1015 - 1020, The test for the
"leave defaults unchanged" case is skipping assertions when
tc.expectStorageClass is nil and tc.expectSizeString is empty, so regressions
that mutate defaults go unnoticed; update the test to explicitly assert
unchanged state by either (A) setting tc.expectStorageClass and
tc.expectSizeString to the original expected values for that test case, or (B)
add an explicit flag (e.g., tc.expectUnchanged) and when set compare
cluster.Spec.Etcd.Managed.Storage.PersistentVolume.StorageClassName and
.Size.String() against a saved copy of the pre-run cluster state (or original
defaults) instead of skipping the checks; reference the test table entry for the
"leave defaults unchanged" row and the assertions around
cluster.Spec.Etcd.Managed.Storage.PersistentVolume in the test function to
implement the check.

1332-1337: ⚠️ Potential issue | 🟡 Minor

No-op scheduling case should assert nil defaults.

For the "should not modify the cluster" path, the test only checks positive cases, so unintended writes to NodeSelector/Labels can slip through. (Line 1332 and Line 1335)

✅ Minimal tightening
 			applySchedulingConfig(cluster, opts)
 			if tc.expectNodeSelector != nil {
 				g.Expect(cluster.Spec.NodeSelector).To(Equal(tc.expectNodeSelector))
+			} else {
+				g.Expect(cluster.Spec.NodeSelector).To(BeNil())
 			}
 			if tc.expectLabels != nil {
 				g.Expect(cluster.Spec.Labels).To(Equal(tc.expectLabels))
+			} else {
+				g.Expect(cluster.Spec.Labels).To(BeNil())
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create_test.go` around lines 1332 - 1337, The test currently
only asserts cluster.Spec.NodeSelector and cluster.Spec.Labels when
tc.expectNodeSelector/expectLabels are non-nil, allowing unintended mutations to
slip through; update the assertions in the "should not modify the cluster" path
so that when tc.expectNodeSelector is nil you explicitly assert
cluster.Spec.NodeSelector is nil (and similarly when tc.expectLabels is nil
assert cluster.Spec.Labels is nil), otherwise assert equality to
tc.expectNodeSelector / tc.expectLabels; reference the existing symbols
tc.expectNodeSelector, tc.expectLabels, cluster.Spec.NodeSelector and
cluster.Spec.Labels to locate and modify the checks.
control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go (1)

192-196: ⚠️ Potential issue | 🔴 Critical

Fix the BasicAuth TLS client cert/key volume name collision.

Both TLSClientCert and TLSClientKey use "tls-client-key" as the field name in SecretPath. When both are configured, this generates duplicate volume names (idp-secret-{i}-tls-client-key), causing the cert mount to be overwritten or invalid pod configuration.

Note that convertKeystoneIDP (lines 311-316) correctly uses distinct field names: "tls-client-cert" for the cert and "tls-client-key" for the key.

,

Proposed fix
 	if basicAuthConfig.TLSClientCert.Name != "" {
-		provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-key", corev1.TLSCertKey)
+		provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey)
 	}
 	if basicAuthConfig.TLSClientKey.Name != "" {
 		provider.RemoteConnectionInfo.KeyFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientKey.Name, "tls-client-key", corev1.TLSPrivateKeyKey)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go`
around lines 192 - 196, The TLS client cert and key SecretPath calls under
basicAuthConfig are using the same field name "tls-client-key" causing duplicate
volume names; update the cert call so idpVolumeMounts.SecretPath(i,
basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey) sets
provider.RemoteConnectionInfo.CertFile while keeping the key call as
SecretPath(..., "tls-client-key", corev1.TLSPrivateKeyKey); locate these in the
same block that sets provider.RemoteConnectionInfo in idp_convert.go and mirror
the distinct field naming used by convertKeystoneIDP.
control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go (1)

152-156: ⚠️ Potential issue | 🔴 Critical

Fix the BasicAuth TLS client cert/key volume name collision.

Same issue as the v2 version: both TLSClientCert and TLSClientKey use "tls-client-key" as the field name. This causes volume name collisions when both are configured.

convertKeystoneIDP (lines 243-248) correctly uses distinct field names.

,

Proposed fix
 	if basicAuthConfig.TLSClientCert.Name != "" {
-		provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-key", corev1.TLSCertKey)
+		provider.RemoteConnectionInfo.CertFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientCert.Name, "tls-client-cert", corev1.TLSCertKey)
 	}
 	if basicAuthConfig.TLSClientKey.Name != "" {
 		provider.RemoteConnectionInfo.KeyFile = idpVolumeMounts.SecretPath(i, basicAuthConfig.TLSClientKey.Name, "tls-client-key", corev1.TLSPrivateKeyKey)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go`
around lines 152 - 156, The TLS client cert/key volume name is colliding because
both calls to idpVolumeMounts.SecretPath use the same field name
"tls-client-key"; update the call that sets
provider.RemoteConnectionInfo.CertFile (in the same conversion function where
basicAuthConfig.TLSClientCert and TLSClientKey are handled) to use a distinct
field name such as "tls-client-cert" (mirror the pattern used in
convertKeystoneIDP) so CertFile uses "tls-client-cert" and KeyFile keeps
"tls-client-key", leaving idpVolumeMounts.SecretPath and
provider.RemoteConnectionInfo.* assignments otherwise unchanged.
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

788-796: ⚠️ Potential issue | 🟡 Minor

Fail fast on unexpected object types in test stubs.

These createOrUpdate stubs still treat non-*networkingv1.NetworkPolicy objects as success, which can mask regressions.

Suggested hardening pattern (apply to each stub)
+import "fmt"
...
 createOrUpdate := upsert.CreateOrUpdateFN(func(ctx context.Context, c client.Client, obj client.Object, f controllerutil.MutateFn) (controllerutil.OperationResult, error) {
-	if netPol, ok := obj.(*networkingv1.NetworkPolicy); ok {
-		if err := f(); err != nil {
-			return controllerutil.OperationResultNone, err
-		}
-		createdPolicies[netPol.Name] = netPol
-	}
-	return controllerutil.OperationResultCreated, nil
+	netPol, ok := obj.(*networkingv1.NetworkPolicy)
+	if !ok {
+		return controllerutil.OperationResultNone, fmt.Errorf("unexpected object type: %T", obj)
+	}
+	if err := f(); err != nil {
+		return controllerutil.OperationResultNone, err
+	}
+	createdPolicies[netPol.Name] = netPol
+	return controllerutil.OperationResultCreated, nil
 })

Also applies to: 912-920, 1060-1068, 1139-1147, 1214-1222, 1289-1297

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 788 - 796, The test stub createOrUpdate (constructed via
upsert.CreateOrUpdateFN) currently treats any non-*networkingv1.NetworkPolicy as
success and silently ignores type mismatches; change the stub to fail fast by
returning an error when obj is not a *networkingv1.NetworkPolicy (e.g., check
type in the func for createOrUpdate and return
controllerutil.OperationResultNone with a descriptive error if the assertion
fails), and apply the same hardening to the other stubs mentioned (the similar
CreateOrUpdateFN instances that track createdPolicies in this file) so
unexpected object types surface as test failures rather than being treated as
success.
hypershift-operator/controllers/hostedcluster/network_policies.go (1)

115-118: ⚠️ Potential issue | 🔴 Critical

Add nil-safety before iterating management cluster networks.

getManagementClusterNetwork can return nil on Line 117, but reconcileVirtLauncherNetworkPolicy dereferences it on Lines 612 and 616. This can panic for KubeVirt when CapabilityNetworks is unavailable.

Suggested fix
 func reconcileVirtLauncherNetworkPolicy(log logr.Logger, policy *networkingv1.NetworkPolicy, hcluster *hyperv1.HostedCluster, managementClusterNetwork *configv1.Network) error {
 	protocolTCP := corev1.ProtocolTCP
 	protocolUDP := corev1.ProtocolUDP
 	protocolSCTP := corev1.ProtocolSCTP

 	blockedIPv4Networks := []string{}
 	blockedIPv6Networks := []string{}
-	for _, network := range managementClusterNetwork.Spec.ClusterNetwork {
-		blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network.CIDR, blockedIPv4Networks, blockedIPv6Networks)
-	}
-
-	for _, network := range managementClusterNetwork.Spec.ServiceNetwork {
-		blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network, blockedIPv4Networks, blockedIPv6Networks)
+	if managementClusterNetwork != nil {
+		for _, network := range managementClusterNetwork.Spec.ClusterNetwork {
+			blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network.CIDR, blockedIPv4Networks, blockedIPv6Networks)
+		}
+		for _, network := range managementClusterNetwork.Spec.ServiceNetwork {
+			blockedIPv4Networks, blockedIPv6Networks = addToBlockedNetworks(network, blockedIPv4Networks, blockedIPv6Networks)
+		}
 	}

Also applies to: 612-617

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies.go` around
lines 115 - 118, getManagementClusterNetwork can return nil when
CapabilityNetworks is unavailable; update reconcileVirtLauncherNetworkPolicy to
guard against that by checking the returned *configv1.Network for nil before
dereferencing or iterating its fields (e.g., before accessing network.Spec or
looping over network.Status/NetworkType), and if nil, skip or return early from
the parts that require management-cluster networks; alternatively ensure
getManagementClusterNetwork never returns nil by returning an empty Network
object—refer to getManagementClusterNetwork and
reconcileVirtLauncherNetworkPolicy to apply the nil-check and early-return fix.
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)

627-635: ⚠️ Potential issue | 🔴 Critical

Index out of bounds: length check must precede slice access.

Line 627 accesses output.VpcEndpoints[0] before the empty-slice check at lines 631-634. If AWS returns a successful response with zero endpoints, this causes a panic.

🐛 Proposed fix
-	if err := deleteEndpointIfWrongService(ctx, ec2Client, output.VpcEndpoints[0], awsEndpointService.Status.EndpointServiceName, log); err != nil {
-		return "", nil, err
-	}
-
 	if len(output.VpcEndpoints) == 0 {
 		awsEndpointService.Status.EndpointID = ""
 		return "", nil, fmt.Errorf("endpoint with id %s not found, resetting status", endpointID)
 	}
+
+	if err := deleteEndpointIfWrongService(ctx, ec2Client, output.VpcEndpoints[0], awsEndpointService.Status.EndpointServiceName, log); err != nil {
+		return "", nil, err
+	}
+
 	log.Info("endpoint exists", "endpointID", endpointID)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`
around lines 627 - 635, The code calls deleteEndpointIfWrongService(ctx,
ec2Client, output.VpcEndpoints[0],
awsEndpointService.Status.EndpointServiceName, log) before checking
len(output.VpcEndpoints), which can panic when VpcEndpoints is empty; fix by
first verifying len(output.VpcEndpoints) > 0 and handling the empty case (reset
awsEndpointService.Status.EndpointID, return the formatted error referencing
endpointID) before any access to output.VpcEndpoints, then call
deleteEndpointIfWrongService with output.VpcEndpoints[0] and proceed to log the
existing endpoint.
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go (2)

4008-4012: ⚠️ Potential issue | 🟡 Minor

Verify the partial version status on the forced list-failure path.

The behavior under test here is not just “returns an error”; reconcileControlPlaneVersionStatus is also supposed to persist a Partial ControlPlaneVersion before returning. A broken status patch still passes this test today.

Suggested assertion
 			originalHCP := tc.hcp.DeepCopy()
 			err := r.reconcileControlPlaneVersionStatus(t.Context(), tc.hcp, originalHCP, releaseImage)
 			if tc.expectError {
 				g.Expect(err).To(HaveOccurred())
+				if tc.name == "When component list fails, it should set partial version and return error" {
+					updated := &hyperv1.HostedControlPlane{}
+					g.Expect(c.Get(t.Context(), client.ObjectKeyFromObject(tc.hcp), updated)).To(Succeed())
+					g.Expect(updated.Status.ControlPlaneVersion.Desired.Image).To(Equal(resolvedRef.String()))
+					g.Expect(updated.Status.ControlPlaneVersion.History).ToNot(BeEmpty())
+					g.Expect(updated.Status.ControlPlaneVersion.History[0].State).To(Equal(configv1.PartialUpdate))
+				}
 				return
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`
around lines 4008 - 4012, The test currently only asserts an error was returned
from reconcileControlPlaneVersionStatus but must also verify that a Partial
ControlPlaneVersion was persisted before the error; after calling
reconcileControlPlaneVersionStatus with originalHCP, tc.hcp and releaseImage
(and when tc.expectError is true) fetch the updated HCP from the fake client and
assert its Status.ControlPlaneVersion (or the ControlPlaneVersion slice/field
used by reconcileControlPlaneVersionStatus) contains a new entry with Version
equal to the releaseImage version and State/Type set to Partial (or the
enum/string the controller uses) — add this assertion immediately after
expecting the error to ensure the forced-list-failure path writes the partial
status.

2780-2900: ⚠️ Potential issue | 🟡 Minor

Assert EtcdSnapshotRestored in the restore-snapshot case.

This case is named around restore completion, but it only verifies EtcdAvailable. A regression in the restore-status branch would still pass.

Suggested assertion
 			cond := meta.FindStatusCondition(tc.hcp.Status.Conditions, string(hyperv1.EtcdAvailable))
 			g.Expect(cond).ToNot(BeNil())
 			g.Expect(cond.Type).To(Equal(tc.expectedCondition.Type))
 			g.Expect(cond.Status).To(Equal(tc.expectedCondition.Status))
 			g.Expect(cond.Reason).To(Equal(tc.expectedCondition.Reason))
 			g.Expect(cond.ObservedGeneration).To(Equal(tc.expectedCondition.ObservedGeneration))
 			if tc.expectedCondition.Message != "" {
 				g.Expect(cond.Message).To(Equal(tc.expectedCondition.Message))
 			}
+
+			if tc.hcp.Spec.Etcd.Managed != nil && len(tc.hcp.Spec.Etcd.Managed.Storage.RestoreSnapshotURL) > 0 {
+				restored := meta.FindStatusCondition(tc.hcp.Status.Conditions, string(hyperv1.EtcdSnapshotRestored))
+				g.Expect(restored).ToNot(BeNil())
+				g.Expect(restored.Status).To(Equal(metav1.ConditionTrue))
+				g.Expect(restored.ObservedGeneration).To(Equal(tc.hcp.Generation))
+			}
 		})
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go`
around lines 2780 - 2900, The test case for restore-snapshot should also assert
the EtcdSnapshotRestored condition: in the test loop after calling
r.reconcileEtcdStatus(t.Context(), tc.hcp) (the
HostedControlPlaneReconciler.reconcileEtcdStatus test), when
tc.hcp.Spec.Etcd.Managed.Storage.RestoreSnapshotURL is set and the
StatefulSet/Pod readiness objects are present, add an assertion that
meta.FindStatusCondition(tc.hcp.Status.Conditions,
string(hyperv1.EtcdSnapshotRestored)) is not nil and that its Status equals
metav1.ConditionTrue (and check ObservedGeneration matches tc.hcp.Generation if
desired); place this check alongside the existing EtcdAvailable assertions so
the restore-specific expectation fails if restore-status logic regresses.
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)

386-392: ⚠️ Potential issue | 🔴 Critical

Prevent UnauthorizedOperation from falling through to the error return.

This still returns destroyErr for UnauthorizedOperation because the else only belongs to the DependencyViolation check. That can leave the finalizer stuck even after logging that deletion is being skipped.

Suggested fix
-			if code == "UnauthorizedOperation" {
-				r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of unauthorized operation.")
-			}
-			if code == "DependencyViolation" {
-				r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of dependency violation.")
-			} else {
-				return ctrl.Result{}, fmt.Errorf("failed to delete AWS default security group: %w", destroyErr)
-			}
+			switch code {
+			case "UnauthorizedOperation":
+				r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of unauthorized operation.")
+			case "DependencyViolation":
+				r.Log.Error(destroyErr, "Skipping AWS default security group deletion because of dependency violation.")
+			default:
+				return ctrl.Result{}, fmt.Errorf("failed to delete AWS default security group: %w", destroyErr)
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 386 - 392, The two separate if checks for error codes allow the
"UnauthorizedOperation" case to fall through into the final error return; change
the branching around the AWS default security group deletion to use an if / else
if / else (or explicitly return after logging) so that when code ==
"UnauthorizedOperation" or code == "DependencyViolation" you only log via
r.Log.Error(destroyErr, ...) and then return ctrl.Result{}, nil (or otherwise
skip returning the fmt.Errorf), and only execute the fmt.Errorf return in the
final else; reference the variables and calls in this block (code, destroyErr,
r.Log.Error, ctrl.Result{}, fmt.Errorf) to locate and update the logic.
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (3)

3232-3234: ⚠️ Potential issue | 🟡 Minor

Add a positive assertion for the cloud-cleanup branch.

The expectCloudCleanup cases still pass without proving that destroyCloudResources actually ran. A direct assertion on a cleanup side effect, like the CloudResourcesDestroyed condition, would catch cleanup-gate regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 3232 - 3234, Test currently only asserts no-requeue when
expectCloudCleanup is false; add a positive assertion for the branch where
tt.expectCloudCleanup is true by asserting the cleanup side-effect performed
(e.g., verify the CloudResourcesDestroyed condition was set/updated and/or that
destroyCloudResources was invoked). Locate the table-driven test using
tt.expectCloudCleanup and add an assertion after reconciliation that when
tt.expectCloudCleanup == true the status contains the CloudResourcesDestroyed
condition (or the equivalent condition name used in the test) and/or that the
mocked destroyCloudResources recorder was called, so the test proves the cleanup
path executed.

3216-3230: ⚠️ Potential issue | 🟡 Minor

Verify the binding independently from the policy.

The table has separate expectVAPDeleted and expectVAPBDeleted flags, but the binding assertion is still gated by expectVAPDeleted. A binding-only regression won't be caught.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 3216 - 3230, The binding deletion check is incorrectly tied to
expectVAPDeleted; update the test to assert the ValidatingAdmissionPolicyBinding
independently using the expectVAPBDeleted flag: when expectVAPBDeleted is true
call guestClient.Get on
manifests.ValidatingAdmissionPolicyBinding(fmt.Sprintf("%s-binding",
registry.AdmissionPolicyNameManagementState)) and assert
apierrors.IsNotFound(getErr), and when expectVAPBDeleted is false assert Get
returns no error; keep the existing manifests.ValidatingAdmissionPolicy checks
controlled by expectVAPDeleted.

2732-2734: ⚠️ Potential issue | 🟡 Minor

Make the PATCH assertion fail when PATCH is missing.

The earlier curl check already covers curl, so this && still lets a script without PATCH pass.

Suggested fix
-	if !strings.Contains(script, "curl") && !strings.Contains(script, "PATCH") {
+	if !strings.Contains(script, "PATCH") {
 		t.Error("Check script should PATCH the ConfigMap on success")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 2732 - 2734, The test currently uses if !strings.Contains(script,
"curl") && !strings.Contains(script, "PATCH") which only fails when both are
missing; change the check so the test fails when either is missing (e.g. use a
logical OR or split into two assertions) so that absence of "PATCH" alone will
trigger t.Error; update the condition referencing the script variable and the
t.Error call in the test function to assert both "curl" and "PATCH" are present.
🧹 Nitpick comments (8)
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (2)

1841-1844: Prefer order-insensitive assertions for deleted record sets.

At Line 1843, using Equal couples this test to deletion order rather than behavior. ConsistOf is safer here.

Suggested change
-			if tt.expectDeleteCalled {
-				g.Expect(mockRecords.deletedRecordNames).To(Equal(tt.expectedDeletedRecords))
-			}
+			if tt.expectDeleteCalled {
+				g.Expect(mockRecords.deletedRecordNames).To(ConsistOf(tt.expectedDeletedRecords))
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`
around lines 1841 - 1844, The test currently asserts deletion order by comparing
mockRecords.deletedRecordNames with tt.expectedDeletedRecords using Equal;
change this to an order-insensitive assertion (e.g., Gomega's ConsistOf) so the
test checks that the same set of record names were deleted regardless of order;
update the assertion that reads
g.Expect(mockRecords.deletedRecordNames).To(Equal(tt.expectedDeletedRecords)) to
use ConsistOf (or an equivalent order-insensitive matcher) while keeping the
surrounding conditional logic that checks mockRecords.deleteCalled.

1850-1905: Strengthen mapping assertions beyond count-only validation.

At Line 1904, HaveLen alone can miss wrong targets. Consider asserting NamespacedName contents to verify correct mapping semantics.

Suggested change
 	tests := []struct {
 		name           string
 		hcp            *hyperv1.HostedControlPlane
 		plsCRs         []client.Object
 		expectRequests int
+		expectedNNs    []types.NamespacedName
 	}{
 		{
 			name: "When HCP has the Azure PLS finalizer and PLS CRs exist, it should return requests for all PLS CRs",
@@
 			plsCRs: []client.Object{
 				newTestAzurePLS(t, "private-router", "test-ns"),
 				newTestAzurePLS(t, "oauth-openshift", "test-ns"),
 			},
 			expectRequests: 2,
+			expectedNNs: []types.NamespacedName{
+				{Name: "private-router", Namespace: "test-ns"},
+				{Name: "oauth-openshift", Namespace: "test-ns"},
+			},
 		},
@@
 			requests := mapFn(ctx, tt.hcp)
 			g.Expect(requests).To(HaveLen(tt.expectRequests))
+			if len(tt.expectedNNs) > 0 {
+				actual := make([]types.NamespacedName, 0, len(requests))
+				for _, req := range requests {
+					actual = append(actual, req.NamespacedName)
+				}
+				g.Expect(actual).To(ConsistOf(tt.expectedNNs))
+			}
 		})
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`
around lines 1850 - 1905, The test currently only checks the count of requests
returned by mapHCPToAzurePLS; update the assertion to validate the actual
NamespacedName targets so wrong mappings aren't missed: after calling mapFn(ctx,
tt.hcp) build an expected slice of reconcile.Request (or NamespacedName) from
tt.plsCRs (using each object's GetName and GetNamespace, e.g., for
newTestAzurePLS entries "private-router", "oauth-openshift") and replace the
HaveLen check with a Gomega assertion that the returned requests contain exactly
those expected requests (e.g., Expect(requests).To(ConsistOf(...)) or
Expect(requests).To(Equal(...))) so the test verifies both count and specific
mapped objects; locate this change around the test using
AzurePrivateLinkServiceReconciler.mapHCPToAzurePLS and the local variable
requests.
control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go (1)

830-851: Path assertions are correct; minor improvement for pointer safety.

The path assertions using underscores (e.g., idp_cm_2_ca, idp_secret_3_client-secret) correctly match the mount path format from IDPVolumeMountInfo, while the volume name assertions using hyphens (e.g., idp-cm-2-ca, idp-secret-3-client-secret) correctly match the volume naming format. These tests accurately verify both aspects of the implementation.

The pointer dereference on line 850 could be made safer, though SecretPath always sets DefaultMode via ptr.To[int32](0640):

Safer assertion pattern
 	g.Expect(vmi.Volumes[0].Secret.SecretName).To(Equal("my-secret"))
-	g.Expect(*vmi.Volumes[0].Secret.DefaultMode).To(Equal(int32(0640)))
+	g.Expect(vmi.Volumes[0].Secret.DefaultMode).ToNot(BeNil())
+	g.Expect(*vmi.Volumes[0].Secret.DefaultMode).To(Equal(int32(0640)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go`
around lines 830 - 851, The pointer dereference in
TestIDPVolumeMountInfo_SecretPath is unsafe; before dereferencing
vmi.Volumes[0].Secret.DefaultMode (from SecretPath) add an assertion that the
pointer is non-nil (e.g., Expect(...).ToNot(BeNil())) and then assert its value,
or use a pointer-aware matcher (PointTo/Equal) to check the int32(0640) value;
update the test referencing TestIDPVolumeMountInfo_SecretPath and the SecretPath
result so the DefaultMode pointer is validated for nil before comparing its
dereferenced value.
hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go (1)

1380-1466: Assert corrupted-annotation cleanup side effect, not just booleans.

At Line 1461 you validate return flags, but the corrupted-input path also patches the Pod to remove lastSnapshotTimeAnnotation. Adding that assertion would harden this test against regressions.

🔧 Suggested test hardening
 func TestCheckSnapshotInterval(t *testing.T) {
 	tests := []struct {
 		name                 string
 		annotations          map[string]string
 		minInterval          string
 		expectedShouldCreate bool
 		expectedSkip         bool
+		verifyAnnotationRemoved bool
 	}{
@@
 		{
 			name: "When last snapshot time annotation is corrupted, it should allow snapshot creation",
 			annotations: map[string]string{
 				lastSnapshotTimeAnnotation: "invalid-time",
 			},
 			minInterval:          "1h",
 			expectedShouldCreate: true,
 			expectedSkip:         false,
+			verifyAnnotationRemoved: true,
 		},
@@
 			shouldSnapshot, skipReconcile := reconciler.checkSnapshotInterval(context.Background(), pod, spec, logr.Discard())
 			g.Expect(shouldSnapshot).To(Equal(tt.expectedShouldCreate))
 			g.Expect(skipReconcile).To(Equal(tt.expectedSkip))
+
+			if tt.verifyAnnotationRemoved {
+				updatedPod := &corev1.Pod{}
+				err := fakeClient.Get(context.Background(), types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace}, updatedPod)
+				g.Expect(err).ToNot(HaveOccurred())
+				_, exists := updatedPod.Annotations[lastSnapshotTimeAnnotation]
+				g.Expect(exists).To(BeFalse(), "Corrupted last snapshot time annotation should be removed")
+			}
 		})
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go`
around lines 1380 - 1466, The corrupted-annotation test case should also assert
the side-effect that checkSnapshotInterval removes the invalid
lastSnapshotTimeAnnotation from the Pod; update TestCheckSnapshotInterval to,
for the scenario where annotations[lastSnapshotTimeAnnotation] ==
"invalid-time", after calling reconciler.checkSnapshotInterval retrieve the Pod
from reconciler.client (fake client) and assert that the annotations map no
longer contains lastSnapshotTimeAnnotation (or that it's empty/nil). This
ensures SnapshotReconciler.checkSnapshotInterval's cleanup behavior is tested in
addition to the boolean flags.
cmd/infra/azure/create.go (1)

167-313: Add operation context to propagated errors.

Several branches now return bare err (for example Line 177, Line 198, Line 227, Line 290, Line 303, Line 309). Wrapping these with action context will make CLI failures much easier to triage.

Proposed pattern
- return "", "", "", err
+ return "", "", "", fmt.Errorf("failed to get base domain ID for %q: %w", o.BaseDomain, err)
- return err
+ return fmt.Errorf("failed to create private DNS zone link: %w", err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/azure/create.go` around lines 167 - 313, Several branches in
createNetworkResources, handleIdentitiesAndRBAC, and createDNSAndLBResources
return raw errors; wrap those errors with contextual messages. For each call to
netMgr.GetBaseDomainID,
azureutil.GetNameAndResourceGroupFromNetworkSecurityGroupID,
netMgr.CreateSecurityGroup, netMgr.CreateVirtualNetwork,
rbacMgr.AssignControlPlaneRoles, rbacMgr.AssignDataPlaneRoles,
rbacMgr.AssignWorkloadIdentities, netMgr.CreatePrivateDNSZone,
netMgr.CreatePrivateDNSZoneLink, netMgr.CreatePublicIPAddressForLB, and
netMgr.CreateLoadBalancer, replace returning bare err with fmt.Errorf("failed to
<action> [resource/context]: %w", err) where <action> describes the operation
(e.g., "get base domain ID", "parse NSG ID", "create network security group",
"create virtual network", "assign control plane roles", "create private DNS
zone", "create public IP for LB", etc.) so the functions createNetworkResources,
handleIdentitiesAndRBAC, and createDNSAndLBResources propagate errors with clear
operation context.
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

810-865: Add a nil-management-network KubeVirt test case.

TestReconcilePlatformNetworkPolicies currently always passes a non-nil managementClusterNetwork; add a case with platformType=KubevirtPlatform and managementClusterNetwork=nil to protect against regressions in the nil path.

Also applies to: 925-939

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`
around lines 810 - 865, Add a test case to TestReconcilePlatformNetworkPolicies
to exercise the nil managementClusterNetwork path: in the testCases slice add an
entry with name indicating "KubeVirt with nil managementClusterNetwork",
platformType set to hyperv1.KubevirtPlatform, kubevirtCredentials left nil,
managementClusterNetwork explicitly nil, version set (e.g. "4.15.0"), and the
expected outcome (expectVirtLauncher true if you want the virt-launcher policy
created when no credentials are present). Update the parallel KubeVirt-related
test block(s) (the other testCases region referenced around the later lines) the
same way so the nil-management-cluster-network code path is covered.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (1)

1238-1242: t.Parallel() should be called before test setup.

The call to t.Parallel() at line 1241 occurs after NewGomegaWithT(t) at line 1240. Per Go testing conventions, t.Parallel() should be the first statement in a test or subtest to ensure proper parallel execution setup.

🔧 Suggested fix
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
-			g := NewGomegaWithT(t)
 			t.Parallel()
+			g := NewGomegaWithT(t)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`
around lines 1238 - 1242, Move the t.Parallel() call to the very start of each
subtest so parallel execution is configured before any setup; specifically, in
the table-driven loop where you call t.Run for each case, call t.Parallel() as
the first statement inside the subtest function (before NewGomegaWithT(t) and
any other setup) so that the NewGomegaWithT(t) initialization occurs after the
subtest has been marked parallel.
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)

680-687: Duplicate error logging.

Lines 680 and 686 both log the same error. Consider removing one to avoid log noise.

♻️ Proposed fix
 	if err != nil {
-		log.Error(err, "failed to modify vpc endpoint", "id", endpointID, "addSubnets", addedSubnet, "removeSubnets", removedSubnet, "addSG", addedSG)
 		msg := err.Error()
 		var apiErr smithy.APIError
 		if errors.As(err, &apiErr) {
 			msg = apiErr.ErrorCode()
 		}
-		log.Error(err, "failed to modify vpc endpoint")
+		log.Error(err, "failed to modify vpc endpoint", "id", endpointID, "addSubnets", addedSubnet, "removeSubnets", removedSubnet, "addSG", addedSG)
 		return fmt.Errorf("failed to modify vpc endpoint: %s", msg)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`
around lines 680 - 687, The code logs the same error twice in the modify VPC
endpoint path; remove the redundant log call so the error is only logged once
with full context (keep the detailed log that includes endpointID, addedSubnet,
removedSubnet, addedSG and remove the second log.Error call). Update the block
around the modify call (referencing variables endpointID, addedSubnet,
removedSubnet, addedSG and the log variable) to log once, then build the msg
from the error (using errors.As to extract smithy.APIError) and return the
formatted error as currently done.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 6139b117-5481-4551-a6a9-00051a22f5b8

📥 Commits

Reviewing files that changed from the base of the PR and between ee0e3be and c52964e.

📒 Files selected for processing (58)
  • .golangci.yml
  • cmd/cluster/core/create.go
  • cmd/cluster/core/create_test.go
  • cmd/fix/dr_oidc_iam.go
  • cmd/infra/aws/create.go
  • cmd/infra/azure/create.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/install.go
  • cmd/install/install_test.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/metrics/metrics.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/platform/aws/controller.go
  • hypershift-operator/controllers/platform/aws/controller_test.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler_test.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.go
  • hypershift-operator/main.go
  • ignition-server/controllers/local_ignitionprovider.go
  • ignition-server/controllers/local_ignitionprovider_test.go
  • test/e2e/util/dump/journals.go
  • test/e2e/util/reqserving/verifycp.go
  • test/e2e/util/reqserving/verifyenv.go
  • test/e2e/util/util.go
✅ Files skipped from review due to trivial changes (7)
  • .golangci.yml
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • cmd/install/assets/hypershift_operator_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/cluster/core/create.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • control-plane-operator/main.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • cmd/install/install_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • cmd/install/install.go

Comment thread cmd/infra/aws/create.go Outdated
Comment thread cmd/infra/aws/create.go Outdated
Comment thread cmd/infra/azure/create.go
Comment thread hypershift-operator/controllers/etcdbackup/reconciler.go
Comment thread hypershift-operator/controllers/hostedcluster/metrics/metrics.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/fix/dr_oidc_iam.go (1)

452-468: ⚠️ Potential issue | 🟠 Major

Fail fast when docs are missing in manual mode.

With the new oidcDocsExist && exists gate, the manual flow now falls through into Step 3 whenever the IAM provider exists but the S3 docs do not. That path can create/reconfigure the bucket and only then fail in ensureOIDCDocuments because regenerating docs requires --hc-name. Since this command explicitly advertises manual recovery when the HostedCluster is not accessible, please reject that combination before any live mutations.

Possible guard
  oidcDocsExist, providerARN, exists, err := o.checkOIDCState(ctx, s3Client, iamClient)
  if err != nil {
  	return err
  }
+ if !oidcDocsExist && o.HostedClusterName == "" && !o.DryRun {
+ 	return fmt.Errorf("OIDC documents are missing; rerun with --hc-name/--hc-namespace so they can be regenerated from the cluster signing key")
+ }
  if oidcDocsExist && exists && !o.ForceRecreate {
  	fmt.Println("\nAll OIDC components are in place - no action needed!")
  	return nil
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/fix/dr_oidc_iam.go` around lines 452 - 468, The code allows mutating S3
when the IAM provider exists but OIDC docs are missing, which in "manual" mode
(no HostedCluster name/--hc-name) will later fail; add an early guard before
calling ensureOIDCBucket: if provider exists (exists == true) && oidcDocsExist
== false && o.ForceRecreate == false && the command is in
manual/no-hosted-cluster mode (e.g. o.HCName is empty or equivalent), return an
error instructing the user to supply --hc-name or use --force-recreate instead
of proceeding. Place this check after checkOIDCState(...) and before
ensureOIDCBucket(...), referencing checkOIDCState, oidcDocsExist, exists,
o.ForceRecreate, and ensureOIDCBucket/ensureOIDCDocuments.
♻️ Duplicate comments (8)
cmd/infra/aws/create.go (2)

385-411: ⚠️ Potential issue | 🟠 Major

Ensure temporary VPC cleanup runs on all exit paths.

If CreatePrivateZone fails after createVPC, initialVPC is leaked because deletion only happens at the end. Register cleanup with defer immediately after VPC creation.

Proposed fix
-func (o *CreateInfraOptions) createDNSZones(ctx context.Context, l logr.Logger, clusterCreatorEC2Client awsapi.EC2API, route53Client, vpcOwnerRoute53Client awsapi.ROUTE53API, result *CreateInfraOutput) error {
+func (o *CreateInfraOptions) createDNSZones(ctx context.Context, l logr.Logger, clusterCreatorEC2Client awsapi.EC2API, route53Client, vpcOwnerRoute53Client awsapi.ROUTE53API, result *CreateInfraOutput) (err error) {
 	privateZoneClient := vpcOwnerRoute53Client
 	var initialVPC string
-	var err error
 	if o.PrivateZonesInClusterAccount {
 		privateZoneClient = route53Client
 		if initialVPC, err = o.createVPC(ctx, l, clusterCreatorEC2Client); err != nil {
 			return err
 		}
+		defer func() {
+			if initialVPC == "" {
+				return
+			}
+			if cleanupErr := o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC); cleanupErr != nil {
+				if err == nil {
+					err = cleanupErr
+				} else {
+					l.Error(cleanupErr, "failed to delete temporary VPC", "vpcID", initialVPC)
+				}
+			}
+		}()
 	}
@@
-	if initialVPC != "" {
-		if err := o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC); err != nil {
-			return err
-		}
-	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/create.go` around lines 385 - 411, createDNSZones may leak
initialVPC if createVPC succeeds but a subsequent CreatePrivateZone fails; after
calling createVPC (inside the o.PrivateZonesInClusterAccount branch) immediately
register a defer that calls o.deleteVPC(ctx, l, clusterCreatorEC2Client,
initialVPC) only when initialVPC is non-empty, and ensure the deferred delete
does not overwrite the function's returned error (log or wrap the delete error
instead). In other words, after createVPC returns set initialVPC and then defer
a safe cleanup that references initialVPC and handles/delete errors without
clobbering the outer error, leaving the rest of createDNSZones (calls to
CreatePrivateZone) unchanged.

255-256: ⚠️ Potential issue | 🔴 Critical

Pass a public subnet into the proxy-host path.

Line 428 currently launches the proxy host into result.Zones[0].SubnetID (private subnet in normal flow). That breaks public IP/SSH-based setup for non-PublicOnly clusters.

Proposed fix
-	if err := o.createProxyResources(ctx, l, ec2Client, result); err != nil {
+	if err := o.createProxyResources(ctx, l, ec2Client, result, publicSubnetIDs[0]); err != nil {
 		return nil, err
 	}
-func (o *CreateInfraOptions) createProxyResources(ctx context.Context, l logr.Logger, ec2Client awsapi.EC2API, result *CreateInfraOutput) error {
+func (o *CreateInfraOptions) createProxyResources(ctx context.Context, l logr.Logger, ec2Client awsapi.EC2API, result *CreateInfraOutput, publicSubnetID string) error {
@@
-		proxyResult, err := o.createProxyHost(ctx, l, ec2Client, result.Zones[0].SubnetID, sgGroupID, o.EnableSecureProxy)
+		proxyResult, err := o.createProxyHost(ctx, l, ec2Client, publicSubnetID, sgGroupID, o.EnableSecureProxy)
 		if err != nil {
 			return fmt.Errorf("failed to create proxy host: %w", err)
 		}

Also applies to: 413-438

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/infra/aws/create.go` around lines 255 - 256, The proxy host is being
launched into result.Zones[0].SubnetID (a private subnet) which prevents
obtaining a public IP/SSH access; update createProxyResources (and the
proxy-host creation path in the same block that iterates result.Zones) to use
the zone's public subnet field instead — e.g., choose Zone.PublicSubnetID (or
the equivalent public-subnet property on each element of result.Zones) when
provisioning the proxy host unless the cluster is explicitly PublicOnly; ensure
the subnet ID passed to the proxy host creation call is the public subnet for
that zone so the instance gets a public IP.
hypershift-operator/controllers/hostedcluster/metrics/metrics.go (1)

267-287: ⚠️ Potential issue | 🟠 Major

Return immediately after the HostedCluster list fails.

After the error on Line 268, Collect still emits zeroed aggregates and advances lastCollectTime on Line 287. That turns a transient API failure into “0 clusters” and can permanently drop transition observations from the failed interval.

Suggested fix
 	hclusters := &hyperv1.HostedClusterList{}
 	if err := c.List(context.Background(), hclusters); err != nil {
 		log.Error(err, "failed to list hosted clusters while collecting metrics")
+		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go` around
lines 267 - 287, The HostedCluster list error handling in Collect is incomplete:
after c.List(context.Background(), hclusters) fails you must return immediately
instead of continuing to emit zeroed aggregates and updating c.lastCollectTime;
update the Collect function so that when the List call returns an error (the
log.Error(err, "failed to list hosted clusters while collecting metrics") path)
the function exits (return) and does not call emitAggregatedMetrics,
c.transitionDurationMetric.Collect, c.collectPerClusterMetrics, or update
c.lastCollectTime, preserving previous metrics and transition observations.
hypershift-operator/controllers/etcdbackup/reconciler.go (1)

165-168: ⚠️ Potential issue | 🟠 Major

Clean up temporary resources on job-creation NotFound path.

At Line 165, this path marks backup failed but leaves RBAC/NetworkPolicy created earlier in createResourcesAndJob until a later reconcile. That leaves elevated access longer than needed.

Suggested fix
 	if err := r.createBackupJob(ctx, backup, hcp); err != nil {
 		if apierrors.IsNotFound(err) {
+			if cleanupErr := r.cleanupResources(ctx, backup); cleanupErr != nil {
+				logger.Error(cleanupErr, "failed to cleanup resources after backup job creation failure")
+			}
 			return r.setFailedConditionAndUpdate(ctx, backup, hyperv1.BackupFailedReason, err.Error())
 		}
 		return ctrl.Result{}, fmt.Errorf("failed to create backup Job: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 165 -
168, The NotFound error path in the createBackupJob call marks the Backup failed
but leaves temporary RBAC/NetworkPolicy created earlier by
createResourcesAndJob; modify the NotFound branch so before returning from
setFailedConditionAndUpdate you clean up those temporary resources (e.g., invoke
the same cleanup logic used on other failure paths or add a helper like
deleteTemporaryResourcesForBackup(ctx, backup) that deletes the
Role/RoleBinding/NetworkPolicy and ignores NotFound errors), ensuring the
cleanup is idempotent and errors from cleanup are logged/handled before calling
setFailedConditionAndUpdate.
control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go (1)

821-825: ⚠️ Potential issue | 🟠 Major

Add namespace filter to List operation to prevent cross-namespace deletion.

Line 823 lists services with externalPrivateServiceLabel across all namespaces. Since services are created in hcp.Namespace (Lines 805-808), the deletion loop at Lines 829-833 could inadvertently remove services belonging to other HCPs in different namespaces that share the same label.

♻️ Proposed fix
 	privateExternalServices := &corev1.ServiceList{}
-	if err := r.List(ctx, privateExternalServices, client.HasLabels{externalPrivateServiceLabel}); err != nil {
+	if err := r.List(ctx, privateExternalServices, client.InNamespace(hcp.Namespace), client.HasLabels{externalPrivateServiceLabel}); err != nil {
 		return fmt.Errorf("cannot list private external services: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`
around lines 821 - 825, The List call that fetches privateExternalServices uses
only client.HasLabels{externalPrivateServiceLabel} and thus queries across all
namespaces; update the r.List invocation to include a namespace filter e.g. add
client.InNamespace(hcp.Namespace) alongside the existing client.HasLabels option
so privateExternalServices only contains services in hcp.Namespace (this ensures
the subsequent deletion loop over privateExternalServices.Items only affects
services in the target HCP namespace).
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (2)

583-586: ⚠️ Potential issue | 🟠 Major

Keep this helper in collect-and-continue mode on registry read failures.

Returning here skips ingress reconciliation and OIDC secret sync for the whole pass. That reintroduces fail-fast behavior inside a helper that otherwise aggregates errors.

Suggested fix
 	if err := r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig); err != nil {
 		if !apierrors.IsNotFound(err) {
-			return []error{fmt.Errorf("failed to get registry config: %w", err)}
+			errs = append(errs, fmt.Errorf("failed to get registry config: %w", err))
 		}
 	} else {
 		registryConfigExists = true
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 583 - 586, The helper currently returns immediately when
r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig)
fails (non-NotFound), which short-circuits the rest of the pass; instead, change
this to collect the error and continue: on non-NotFound errors append
fmt.Errorf("failed to get registry config: %w", err) to the helper's
error-collection slice (e.g., errs) and avoid returning, so ingress
reconciliation and OIDC secret sync still run for this pass while the error is
aggregated.

427-429: ⚠️ Potential issue | 🟠 Major

Don't mask accumulated errors behind the recovery requeue.

If recovery asks for a requeue, this branch exits before errs is aggregated, so failures from earlier reconciliation steps are silently dropped.

Suggested fix
-	if result, err := r.reconcileClusterRecovery(ctx, log, hcp, errs); err != nil || result.RequeueAfter > 0 {
-		return result, err
-	}
-
-	return ctrl.Result{}, utilerrors.NewAggregate(errs)
+	if result, err := r.reconcileClusterRecovery(ctx, log, hcp, errs); err != nil {
+		return result, err
+	} else if result.RequeueAfter > 0 {
+		if len(errs) == 0 {
+			return result, nil
+		}
+		return ctrl.Result{}, utilerrors.NewAggregate(errs)
+	}
+
+	return ctrl.Result{}, utilerrors.NewAggregate(errs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`
around lines 427 - 429, The current early return when
reconcileClusterRecovery(ctx, log, hcp, errs) requests a requeue or returns an
error drops previously accumulated errors in errs; instead, call
reconcileClusterRecovery, append its error (if any) to the errs slice, and
before returning on result.RequeueAfter > 0 aggregate and return errs (e.g.,
using errors.NewAggregate or the project's error-aggregation helper) so earlier
reconciliation failures are not masked; reference reconcileClusterRecovery, the
result variable, and errs when making this change.
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (1)

826-830: ⚠️ Potential issue | 🟠 Major

Clear Status.KubeadminPassword when the secret is missing.

On IsNotFound, this leaves any previously published ref untouched, so status can continue pointing at a deleted kubeadmin-password secret.

Suggested fix
 	if err := r.Get(ctx, client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword); err != nil {
 		if !apierrors.IsNotFound(err) {
 			return fmt.Errorf("failed to get kubeadmin password: %w", err)
 		}
+		hostedControlPlane.Status.KubeadminPassword = nil
 	} else {
 		hostedControlPlane.Status.KubeadminPassword = &corev1.LocalObjectReference{
 			Name: kubeadminPassword.Name,
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 826 - 830, When r.Get(ctx,
client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword) returns
apierrors.IsNotFound, clear the published reference on the HostedControlPlane
status (e.g. set hostedControlPlane.Status.KubeadminPassword = "" or nil) and
persist that change with r.Status().Update(ctx, hostedControlPlane), handling
any update error; keep the existing behavior for non-NotFound errors and for the
else branch where the secret exists. Ensure you reference the kubeadminPassword
variable, r.Get call, and Status.KubeadminPassword field when making the change.
🧹 Nitpick comments (3)
cmd/cluster/core/create.go (1)

347-352: Avoid duplicate KAS DNS validation in construction path.

KubeAPIServerDNSName is already validated in validateCapabilities (Line 838+). Re-validating here increases drift risk; keep one validation source and just assign here.

Refactor suggestion
 if len(opts.KubeAPIServerDNSName) > 0 {
-	if err := validation.IsDNS1123Subdomain(opts.KubeAPIServerDNSName); len(err) > 0 {
-		return nil, fmt.Errorf("KubeAPIServerDNSName failed DNS validation: %s", strings.Join(err[:], " "))
-	}
 	prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/core/create.go` around lines 347 - 352, Remove the duplicate DNS
validation here: instead of calling validation.IsDNS1123Subdomain on
opts.KubeAPIServerDNSName in the create path, simply assign
opts.KubeAPIServerDNSName to prototype.Cluster.Spec.KubeAPIServerDNSName when
non-empty; rely on the existing validation already performed in
validateCapabilities (which validates KubeAPIServerDNSName), so delete the
validation block around opts.KubeAPIServerDNSName in the create logic and keep
only the assignment guard (if len(opts.KubeAPIServerDNSName) > 0 {
prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName }).
hypershift-operator/controllers/hostedcluster/network_policies_test.go (1)

686-686: Standardize on t.Context() for consistency.

This test file mixes context.Background() (lines 119, 561, 1445) with t.Context() (lines 686, 798, 925, 1070, 1149, 1224, 1299). Since the project targets Go 1.25.3 (which supports t.Context() without issue) and t.Context() is already used elsewhere in the codebase, consider using t.Context() consistently throughout this file for modern Go testing practices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go` at
line 686, Replace uses of context.Background() in this test file with
t.Context() for consistency; locate the test functions that call
reconciler.getManagementClusterNetwork and other helper/verification calls and
change their ctx argument from context.Background() to t.Context() so all tests
use the same testing context; ensure imports remain unchanged and run `go test
./...` to confirm no context-related failures.
control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go (1)

1238-1242: Minor: Inconsistent t.Parallel() placement.

In this test, NewGomegaWithT(t) is called before t.Parallel(), while other table-driven tests in this file (e.g., lines 653, 680) create Gomega after declaring the test parallel. Consider reordering for consistency:

♻️ Suggested reorder
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
-			g := NewGomegaWithT(t)
 			t.Parallel()
+			g := NewGomegaWithT(t)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`
around lines 1238 - 1242, Move the call to t.Parallel() to run before creating
the Gomega instance so the test goroutine is marked parallel immediately;
specifically, inside the table-driven loop over tests (the t.Run closure that
currently calls NewGomegaWithT(t) then t.Parallel()), call t.Parallel() first
and then create g := NewGomegaWithT(t) to match the other table-driven tests and
avoid racey Gomega setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/cluster/core/create.go`:
- Around line 373-381: The parseKeyValuePairs function currently accepts entries
like "=value" which produce empty keys; update parseKeyValuePairs to validate
that pair[0] (the key) is non-empty after splitting and return a clear error
(e.g., fmt.Errorf("invalid %s: %s (empty key)", kind, s)) when it is empty so
CLI parsing of --annotations/--labels fails fast and prevents invalid metadata
downstream.
- Around line 468-476: The code currently returns early when opts.PausedUntil ==
"true", so cluster.Spec.PausedUntil never gets set; change the logic in the
handler that processes opts.PausedUntil so that an empty string still returns
nil, but when opts.PausedUntil == "true" you set cluster.Spec.PausedUntil =
&opts.PausedUntil and return nil; otherwise validate with
time.Parse(time.RFC3339, opts.PausedUntil) and on success assign
cluster.Spec.PausedUntil = &opts.PausedUntil (and on parse error return the
formatted error). Ensure you update the block referencing opts.PausedUntil and
cluster.Spec.PausedUntil accordingly.

In `@cmd/fix/dr_oidc_iam.go`:
- Around line 496-510: The S3 probe errors are being swallowed by
checkOIDCDocumentsExist; change checkOIDCDocumentsExist to return (bool, error)
and update checkOIDCState to call it accordingly (use oidcDocsExist, err :=
o.checkOIDCDocumentsExist(ctx, s3Client)); if err != nil, return that error
immediately unless it's an explicit NotFound/NoSuchKey condition which you
should translate to oidcDocsExist=false; preserve existing messaging around
o.ForceRecreate and only treat absence (not transient/access errors) as missing
so you do not proceed to mutate bucket/provider on S3 probe failures; update any
other callers of checkOIDCDocumentsExist to handle the new (bool, error)
signature.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 801-805: The code currently skips calling
controlPlaneComponentsAvailable once HostedControlPlaneAvailable is true
(variables availableCondition / alreadyAvailable), which prevents flipping
availability back when components later fail; remove the guard that checks
alreadyAvailable and always call r.controlPlaneComponentsAvailable(ctx,
hostedControlPlane), then use its returned componentsErr /
componentsNotAvailableMsg to set or update the HostedControlPlaneAvailable
condition (setting False with the returned message/error when components are
unavailable and True when they are all available) so availability is
re-evaluated on each reconciliation.
- Around line 724-759: When reconcileInfrastructureStatus (called as
infraStatus, err := r.reconcileInfrastructureStatus(...)) returns an error or
infraStatus.IsReady() is false, clear any previously-set infrastructure-derived
fields so stale endpoints aren't advertised: explicitly reset
hostedControlPlane.Status.ControlPlaneEndpoint (empty Host and zero Port) and
hostedControlPlane.Status.OAuthCallbackURLTemplate to empty strings in the error
branch and the "not ready" branch (the else block), leaving the current
assignments only in the infraStatus.IsReady() branch; ensure these resets occur
before setting newCondition and logging so the status accurately reflects the
lack of a reachable control plane.
- Around line 587-590: The reconcile adds a finalizer to hostedControlPlane
using controllerutil.AddFinalizer and then patches with r.Patch using
originalHCP as the merge base, but subsequent optimistic-lock status patches
(e.g., in reconcileControlPlaneVersionStatus) still use the stale
originalHCP/resourceVersion; fix by either returning immediately after
successfully adding and patching the finalizer (so later work runs on next
reconcile with fresh object) or refresh the patch base by reloading the
hostedControlPlane (e.g., DeepCopyFrom the live object or re-Get) before any
further MergeFromWithOptions/optimistic-lock status patches so their
resourceVersion is current.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 773-789: The error-wrapping strings are using the wrong service
name ("openshift apiserver") for the OAuth-specific helpers; update the
fmt.Errorf messages in the reconcile block around
reconcileOpenshiftOAuthAPIServerAPIServices, the CreateOrUpdate call for
openshiftOAuthAPIServerService, and reconcileOpenshiftOAuthAPIServerEndpoints so
they reference "openshift oauth apiserver" (or "openshift oauth apiserver
service/endpoints" as appropriate) to match the actual reconcile functions and
variables (reconcileOpenshiftOAuthAPIServerAPIServices,
openshiftOAuthAPIServerService, reconcileOpenshiftOAuthAPIServerEndpoints).

In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go`:
- Around line 48-55: The code calls handleExistingEtcdRecoveryJob when
jobStatus.exists but then always proceeds to detectAndTriggerEtcdRecovery, which
can overwrite terminal EtcdRecoveryActive=False and retrigger recovery; change
the control flow so that when jobStatus.exists you short-circuit after
handleExistingEtcdRecoveryJob: if handleExistingEtcdRecoveryJob returns an error
propagate it, otherwise return immediately (do not call
detectAndTriggerEtcdRecovery) to avoid re-entering creation logic for already
completed/failed recovery jobs; update the block around jobStatus.exists,
handleExistingEtcdRecoveryJob, and the subsequent call to
detectAndTriggerEtcdRecovery accordingly.
- Around line 92-98: The current gating only updates the condition when
meta.FindStatusCondition(hcluster.Status.Conditions,
string(hyperv1.EtcdRecoveryActive)) returns nil or oldCondition.Status !=
condition.Status, which misses changes in Reason/Message; change the predicate
so you call meta.SetStatusCondition(&hcluster.Status.Conditions, condition) (and
then r.Client.Status().Update(ctx, hcluster)) whenever oldCondition is nil OR
any of oldCondition.Status, oldCondition.Reason, or oldCondition.Message differs
from condition.Status, condition.Reason, or condition.Message respectively, so
the EtcdRecoveryActive condition is replaced when reason or message changes as
well.

---

Outside diff comments:
In `@cmd/fix/dr_oidc_iam.go`:
- Around line 452-468: The code allows mutating S3 when the IAM provider exists
but OIDC docs are missing, which in "manual" mode (no HostedCluster
name/--hc-name) will later fail; add an early guard before calling
ensureOIDCBucket: if provider exists (exists == true) && oidcDocsExist == false
&& o.ForceRecreate == false && the command is in manual/no-hosted-cluster mode
(e.g. o.HCName is empty or equivalent), return an error instructing the user to
supply --hc-name or use --force-recreate instead of proceeding. Place this check
after checkOIDCState(...) and before ensureOIDCBucket(...), referencing
checkOIDCState, oidcDocsExist, exists, o.ForceRecreate, and
ensureOIDCBucket/ensureOIDCDocuments.

---

Duplicate comments:
In `@cmd/infra/aws/create.go`:
- Around line 385-411: createDNSZones may leak initialVPC if createVPC succeeds
but a subsequent CreatePrivateZone fails; after calling createVPC (inside the
o.PrivateZonesInClusterAccount branch) immediately register a defer that calls
o.deleteVPC(ctx, l, clusterCreatorEC2Client, initialVPC) only when initialVPC is
non-empty, and ensure the deferred delete does not overwrite the function's
returned error (log or wrap the delete error instead). In other words, after
createVPC returns set initialVPC and then defer a safe cleanup that references
initialVPC and handles/delete errors without clobbering the outer error, leaving
the rest of createDNSZones (calls to CreatePrivateZone) unchanged.
- Around line 255-256: The proxy host is being launched into
result.Zones[0].SubnetID (a private subnet) which prevents obtaining a public
IP/SSH access; update createProxyResources (and the proxy-host creation path in
the same block that iterates result.Zones) to use the zone's public subnet field
instead — e.g., choose Zone.PublicSubnetID (or the equivalent public-subnet
property on each element of result.Zones) when provisioning the proxy host
unless the cluster is explicitly PublicOnly; ensure the subnet ID passed to the
proxy host creation call is the public subnet for that zone so the instance gets
a public IP.

In
`@control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go`:
- Around line 821-825: The List call that fetches privateExternalServices uses
only client.HasLabels{externalPrivateServiceLabel} and thus queries across all
namespaces; update the r.List invocation to include a namespace filter e.g. add
client.InNamespace(hcp.Namespace) alongside the existing client.HasLabels option
so privateExternalServices only contains services in hcp.Namespace (this ensures
the subsequent deletion loop over privateExternalServices.Items only affects
services in the target HCP namespace).

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 826-830: When r.Get(ctx,
client.ObjectKeyFromObject(kubeadminPassword), kubeadminPassword) returns
apierrors.IsNotFound, clear the published reference on the HostedControlPlane
status (e.g. set hostedControlPlane.Status.KubeadminPassword = "" or nil) and
persist that change with r.Status().Update(ctx, hostedControlPlane), handling
any update error; keep the existing behavior for non-NotFound errors and for the
else branch where the secret exists. Ensure you reference the kubeadminPassword
variable, r.Get call, and Status.KubeadminPassword field when making the change.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 583-586: The helper currently returns immediately when
r.client.Get(ctx, client.ObjectKeyFromObject(registryConfig), registryConfig)
fails (non-NotFound), which short-circuits the rest of the pass; instead, change
this to collect the error and continue: on non-NotFound errors append
fmt.Errorf("failed to get registry config: %w", err) to the helper's
error-collection slice (e.g., errs) and avoid returning, so ingress
reconciliation and OIDC secret sync still run for this pass while the error is
aggregated.
- Around line 427-429: The current early return when
reconcileClusterRecovery(ctx, log, hcp, errs) requests a requeue or returns an
error drops previously accumulated errors in errs; instead, call
reconcileClusterRecovery, append its error (if any) to the errs slice, and
before returning on result.RequeueAfter > 0 aggregate and return errs (e.g.,
using errors.NewAggregate or the project's error-aggregation helper) so earlier
reconciliation failures are not masked; reference reconcileClusterRecovery, the
result variable, and errs when making this change.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 165-168: The NotFound error path in the createBackupJob call marks
the Backup failed but leaves temporary RBAC/NetworkPolicy created earlier by
createResourcesAndJob; modify the NotFound branch so before returning from
setFailedConditionAndUpdate you clean up those temporary resources (e.g., invoke
the same cleanup logic used on other failure paths or add a helper like
deleteTemporaryResourcesForBackup(ctx, backup) that deletes the
Role/RoleBinding/NetworkPolicy and ignores NotFound errors), ensuring the
cleanup is idempotent and errors from cleanup are logged/handled before calling
setFailedConditionAndUpdate.

In `@hypershift-operator/controllers/hostedcluster/metrics/metrics.go`:
- Around line 267-287: The HostedCluster list error handling in Collect is
incomplete: after c.List(context.Background(), hclusters) fails you must return
immediately instead of continuing to emit zeroed aggregates and updating
c.lastCollectTime; update the Collect function so that when the List call
returns an error (the log.Error(err, "failed to list hosted clusters while
collecting metrics") path) the function exits (return) and does not call
emitAggregatedMetrics, c.transitionDurationMetric.Collect,
c.collectPerClusterMetrics, or update c.lastCollectTime, preserving previous
metrics and transition observations.

---

Nitpick comments:
In `@cmd/cluster/core/create.go`:
- Around line 347-352: Remove the duplicate DNS validation here: instead of
calling validation.IsDNS1123Subdomain on opts.KubeAPIServerDNSName in the create
path, simply assign opts.KubeAPIServerDNSName to
prototype.Cluster.Spec.KubeAPIServerDNSName when non-empty; rely on the existing
validation already performed in validateCapabilities (which validates
KubeAPIServerDNSName), so delete the validation block around
opts.KubeAPIServerDNSName in the create logic and keep only the assignment guard
(if len(opts.KubeAPIServerDNSName) > 0 {
prototype.Cluster.Spec.KubeAPIServerDNSName = opts.KubeAPIServerDNSName }).

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go`:
- Around line 1238-1242: Move the call to t.Parallel() to run before creating
the Gomega instance so the test goroutine is marked parallel immediately;
specifically, inside the table-driven loop over tests (the t.Run closure that
currently calls NewGomegaWithT(t) then t.Parallel()), call t.Parallel() first
and then create g := NewGomegaWithT(t) to match the other table-driven tests and
avoid racey Gomega setup.

In `@hypershift-operator/controllers/hostedcluster/network_policies_test.go`:
- Line 686: Replace uses of context.Background() in this test file with
t.Context() for consistency; locate the test functions that call
reconciler.getManagementClusterNetwork and other helper/verification calls and
change their ctx argument from context.Background() to t.Context() so all tests
use the same testing context; ensure imports remain unchanged and run `go test
./...` to confirm no context-related failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9dfb8731-f9fa-4ee6-97da-1708a77b39c0

📥 Commits

Reviewing files that changed from the base of the PR and between c52964e and 47afd7e.

📒 Files selected for processing (58)
  • .golangci.yml
  • cmd/cluster/core/create.go
  • cmd/cluster/core/create_test.go
  • cmd/fix/dr_oidc_iam.go
  • cmd/infra/aws/create.go
  • cmd/infra/azure/create.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/install.go
  • cmd/install/install_test.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/metrics/metrics.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/platform/aws/controller.go
  • hypershift-operator/controllers/platform/aws/controller_test.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler_test.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.go
  • hypershift-operator/main.go
  • ignition-server/controllers/local_ignitionprovider.go
  • ignition-server/controllers/local_ignitionprovider_test.go
  • test/e2e/util/dump/journals.go
  • test/e2e/util/reqserving/verifycp.go
  • test/e2e/util/reqserving/verifyenv.go
  • test/e2e/util/util.go
✅ Files skipped from review due to trivial changes (8)
  • .golangci.yml
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • cmd/infra/azure/create.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/cluster/core/create_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • cmd/install/install.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/install_test.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go

Comment thread cmd/cluster/core/create.go
Comment thread cmd/cluster/core/create.go
Comment thread cmd/fix/dr_oidc_iam.go
Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery.go
Comment thread hypershift-operator/controllers/hostedcluster/etcd_recovery.go
@bryan-cox bryan-cox force-pushed the cyclo-magic branch 2 times, most recently from 9e90018 to ad30a1c Compare April 23, 2026 11:53
@bryan-cox
Copy link
Copy Markdown
Member Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 23, 2026

Test Results

e2e-aws

e2e-aks

@openshift-ci openshift-ci Bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 7, 2026
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/test images

@cblecker
Copy link
Copy Markdown
Member

cblecker commented May 8, 2026

/lgtm

@bryan-cox
Copy link
Copy Markdown
Member Author

Hey @cblecker 👋🏻 - do you mind adding the /lgtm again? It didn't take on Friday.

@cblecker
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 11, 2026

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 3 remaining resources for guest cluster (TestCreateClusterPrivateWithRouteKAS)
fixture.go:340: Failed to clean up 9 remaining resources for guest cluster (TestKarpenter)

Summary

Both test failures (TestCreateClusterPrivateWithRouteKAS and TestKarpenter) occurred exclusively in their /Teardown subtests due to AWS infrastructure resource cleanup timeouts. All functional subtests (ValidateHostedCluster, Main, EnsureHostedCluster) passed for both tests. The teardown failures are caused by AWS resources (EBS volumes and NLBs) not being deleted within the deadline — a known flaky infrastructure issue unrelated to the PR's gocyclo linter and cyclomatic complexity refactoring changes, which do not touch any teardown, fixture, or cluster destroy logic.

Root Cause

The root cause is an AWS infrastructure resource cleanup timeout during hosted cluster teardown — a known flaky condition in HyperShift e2e tests:

  1. TestCreateClusterPrivateWithRouteKAS/Teardown (1372s): The hosted cluster private-2bjjd was destroyed successfully (hypershift_framework.go:574: Destroyed cluster), but 3 AWS resources could not be deleted within the deadline:

    • 1 NLB (openshift-ingress/router-default)
    • 2 EBS volumes (for CAPI machines ml4xp and wp2dp)
  2. TestKarpenter/Teardown (1383s): The hosted cluster karpenter-tp84g initially failed to finalize (hostedcluster wasn't finalized, aborting delete: context deadline exceeded), was eventually destroyed on retry, but 9 AWS resources remained:

    • 7 EBS volumes (from various Karpenter nodepools: on-demand, version-test, capacity-reservation-test, instance-profile-test, arbitrary-subnet-test, plus 2 CAPI machine volumes)
    • 1 NLB (openshift-ingress/router-default)
    • Additionally, bastion destroy failed because the HostedCluster CR was already deleted

This failure is NOT related to PR #8309. The PR exclusively modifies .golangci.yml configuration and refactors controller/operator code to reduce cyclomatic complexity. It does not touch fixture.go, hypershift_framework.go, or any teardown/destroy logic. The test/e2e/util/dump/journals.go file listed in the PR had 0 additions and 0 deletions (likely a formatting-only change).

Recommendations
  1. Retry the job — These are infrastructure teardown flakes, not functional test failures. All 516 non-skipped, non-teardown tests passed. The PR changes are safe.

  2. Known flake pattern — The "Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded" pattern is a well-known HyperShift e2e flake caused by AWS eventual consistency in resource deletion (EBS volumes and NLBs are slow to detach and delete).

  3. No action needed on PR CNTRLPLANE-3324: Enable gocyclo linter and reduce cyclomatic complexity #8309 — The gocyclo linter enablement and complexity refactoring are unrelated to AWS infrastructure cleanup.

Evidence
Evidence Detail
Failed tests TestCreateClusterPrivateWithRouteKAS/Teardown (1372s), TestKarpenter/Teardown (1383s)
Failure type AWS infra resource cleanup timeout (context deadline exceeded)
Leftover resources 3 resources (private-2bjjd: 1 NLB + 2 EBS), 9 resources (karpenter-tp84g: 1 NLB + 7 EBS + bastion)
All functional subtests PASSED — ValidateHostedCluster ✓, Main ✓, EnsureHostedCluster ✓ for both tests
Total test results 541 tests, 25 skipped, 4 failures (teardown-only)
PR #8309 scope .golangci.yml + controller/operator cyclomatic complexity refactoring
PR touches teardown code? No — does not modify fixture.go, hypershift_framework.go, or destroy logic
Cluster destroy status Both clusters were eventually destroyed successfully (hypershift_framework.go:574)
Error source fixture.go:333 — infra resource wait timeout; fixture.go:340 — residual resource report

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 6198e0f into openshift:main May 11, 2026
43 checks passed
@bryan-cox bryan-cox deleted the cyclo-magic branch May 11, 2026 19:23
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request May 12, 2026
Restore and improve comments that document hidden constraints, edge
cases, and design decisions across files affected by the gocyclo
complexity refactor in PR openshift#8309.

Changes:
- Add missing requeue comment in AWS private link deletion flow
- Restore OAuth IDP conversion comments: GitLab OIDC requirement,
  Keystone ID, IBM Cloud special case, ID claim safety, challenge
  flow validation constraint
- Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP
- Add ARO-HCP OpenAPI spec link for Azure resource type constant
- Restore CAPI label propagation doc comment with TODO and JIRA link
  (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades
- Restore globalPS DaemonSet scheduling comment

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request May 12, 2026
Restore and improve comments that document hidden constraints, edge
cases, and design decisions across files affected by the gocyclo
complexity refactor in PR openshift#8309.

Changes:
- Add missing requeue comment in AWS private link deletion flow
- Restore OAuth IDP conversion comments: GitLab OIDC requirement,
  Keystone ID, IBM Cloud special case, ID claim safety, challenge
  flow validation constraint
- Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP
- Add ARO-HCP OpenAPI spec link for Azure resource type constant
- Restore CAPI label propagation doc comment with TODO and JIRA link
  (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades
- Restore globalPS DaemonSet scheduling comment

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request May 12, 2026
Restore and improve comments that document hidden constraints, edge
cases, and design decisions across files affected by the gocyclo
complexity refactor in PR openshift#8309.

Changes:
- Add missing requeue comment in AWS private link deletion flow
- Restore OAuth IDP conversion comments: GitLab OIDC requirement,
  Keystone ID, IBM Cloud special case, ID claim safety, challenge
  flow validation constraint
- Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP
- Add ARO-HCP OpenAPI spec link for Azure resource type constant
- Restore CAPI label propagation doc comment with TODO and JIRA link
  (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades
- Restore globalPS DaemonSet scheduling comment

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request May 12, 2026
Restore and improve comments that document hidden constraints, edge
cases, and design decisions across files affected by the gocyclo
complexity refactor in PR openshift#8309.

Changes:
- Add missing requeue comment in AWS private link deletion flow
- Restore OAuth IDP conversion comments: GitLab OIDC requirement,
  Keystone ID, IBM Cloud special case, ID claim safety, challenge
  flow validation constraint
- Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP
- Add ARO-HCP OpenAPI spec link for Azure resource type constant
- Restore CAPI label propagation doc comment with TODO and JIRA link
  (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades
- Restore globalPS DaemonSet scheduling comment

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request May 12, 2026
Restore and improve comments that document hidden constraints, edge
cases, and design decisions across files affected by the gocyclo
complexity refactor in PR openshift#8309.

Changes:
- Add missing requeue comment in AWS private link deletion flow
- Restore OAuth IDP conversion comments: GitLab OIDC requirement,
  Keystone ID, IBM Cloud special case, ID claim safety, challenge
  flow validation constraint
- Fix comment formatting (//Handle -> // Handle) in v2 OAuth IDP
- Add ARO-HCP OpenAPI spec link for Azure resource type constant
- Restore CAPI label propagation doc comment with TODO and JIRA link
  (HOSTEDCP-971) explaining why labels/taints bypass rolling upgrades
- Restore globalPS DaemonSet scheduling comment

Signed-off-by: Bryan Cox <brcox@redhat.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants