Skip to content

CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework#8527

Draft
bryan-cox wants to merge 9 commits into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3222
Draft

CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework#8527
bryan-cox wants to merge 9 commits into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3222

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented May 15, 2026

Summary

Ports 4 remaining v1 lifecycle tests to the v2 Ginkgo framework with platform-agnostic labels, enabling them to run in the Azure self-managed CI pipeline (and any future platform CI).

  • NodePool autoscaling (nodepool-autoscaling): scale-up/down and multi-NodePool balancing
  • NodePool lifecycle (nodepool-lifecycle): 13 sub-tests covering machineconfig, NTO, upgrades (replace/in-place/rolling), previous releases, mirror configs, trust bundles, performance profiles, auto-repair, and disk encryption
  • Control plane upgrade (control-plane-upgrade): N-1 → latest release upgrade with component and version rollout verification
  • Etcd chaos (etcd-chaos): 5 sub-tests for single member recovery, random/all member kills, WAL corruption, and missing member recovery

Also includes:

  • Widened 12 e2eutil helper functions from *testing.T to testing.TB for Ginkgo compatibility (GinkgoTB() returns testing.TB, not *testing.T)
  • create-guests binary for CI cluster creation (wraps hypershift CLI, waits for availability)
  • 6 new env var registrations for release images, Azure creds, and DES ID

Design decisions

  • Upgrade + etcd-chaos share a single HA cluster (created at N-1, upgraded, then chaos-tested)
  • Autoscaling + lifecycle run on the existing public cluster
  • AutoRepair and DiskEncryption are skeleton tests (need cloud SDK dependencies)
  • Functions using t.Run() (e.g. EnsureNoCrashingPods) remain as TODOs since testing.TB doesn't expose Run()

Companion PR needed: openshift/release changes for CI wiring (create/destroy/dump scripts, label filters, job config)

Test plan

  • make verify passes
  • make e2e builds clean (v1 backwards-compatible after testing.TB change)
  • make e2ev2 builds clean
  • go build -tags e2ev2 ./test/e2e/v2/cmd/create-guests/ compiles
  • go vet -tags e2ev2 ./test/e2e/v2/... clean
  • CI: /test e2e-azure-v2-self-managed (existing tests still pass)
  • CI: full lifecycle test run after release repo companion PR lands

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Expanded e2e coverage with suites for control-plane upgrades, etcd chaos, nodepool autoscaling, and comprehensive nodepool lifecycle scenarios; registered many new test cases and helpers.
  • New Features

    • Added v2 e2e helper programs to create, run, destroy, and collect dumps from guest clusters; introduced a test orchestrator to run groups of tests concurrently and sequence upgrade/chaos flows.
  • Chores

    • Added Makefile targets, Docker image updates, and new environment variables to support the v2 e2e workflows.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 15, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 15, 2026

@bryan-cox: This pull request references CNTRLPLANE-3222 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Ports 4 remaining v1 lifecycle tests to the v2 Ginkgo framework with platform-agnostic labels, enabling them to run in the Azure self-managed CI pipeline (and any future platform CI).

  • NodePool autoscaling (nodepool-autoscaling): scale-up/down and multi-NodePool balancing
  • NodePool lifecycle (nodepool-lifecycle): 13 sub-tests covering machineconfig, NTO, upgrades (replace/in-place/rolling), previous releases, mirror configs, trust bundles, performance profiles, auto-repair, and disk encryption
  • Control plane upgrade (control-plane-upgrade): N-1 → latest release upgrade with component and version rollout verification
  • Etcd chaos (etcd-chaos): 5 sub-tests for single member recovery, random/all member kills, WAL corruption, and missing member recovery

Also includes:

  • Widened 12 e2eutil helper functions from *testing.T to testing.TB for Ginkgo compatibility (GinkgoTB() returns testing.TB, not *testing.T)
  • create-guests binary for CI cluster creation (wraps hypershift CLI, waits for availability)
  • 6 new env var registrations for release images, Azure creds, and DES ID

Design decisions

  • Upgrade + etcd-chaos share a single HA cluster (created at N-1, upgraded, then chaos-tested)
  • Autoscaling + lifecycle run on the existing public cluster
  • AutoRepair and DiskEncryption are skeleton tests (need cloud SDK dependencies)
  • Functions using t.Run() (e.g. EnsureNoCrashingPods) remain as TODOs since testing.TB doesn't expose Run()

Companion PR needed: openshift/release changes for CI wiring (create/destroy/dump scripts, label filters, job config)

Test plan

  • make verify passes
  • make e2e builds clean (v1 backwards-compatible after testing.TB change)
  • make e2ev2 builds clean
  • go build -tags e2ev2 ./test/e2e/v2/cmd/create-guests/ compiles
  • go vet -tags e2ev2 ./test/e2e/v2/... clean
  • CI: /test e2e-azure-v2-self-managed (existing tests still pass)
  • CI: full lifecycle test run after release repo companion PR lands

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds v2 Ginkgo E2E suites and orchestration tooling: control-plane upgrade, etcd chaos (five scenarios), nodepool autoscaling, and nodepool lifecycle tests. Widened e2e helper signatures to testing.TB, registered additional lifecycle env vars, added e2ev2 CLI binaries (create-guests, run-tests, destroy-guests, dump-guests), and wired Makefile and Dockerfile to build and package those binaries.

Sequence Diagram(s)

sequenceDiagram
  participant CreateCLI as create-guests
  participant HypershiftCLI as hypershift
  participant MgmtAPI as ManagementCluster API
  participant HostedCluster as HostedCluster CR

  CreateCLI->>HypershiftCLI: exec "hypershift create cluster azure" (args per variant)
  HypershiftCLI->>MgmtAPI: submit HostedCluster CR / related resources
  MgmtAPI->>HostedCluster: create/update HostedCluster
  HostedCluster->>MgmtAPI: status updates (Available, Version History)
  MgmtAPI->>CreateCLI: status observed via controller-runtime watch
Loading

Possibly related PRs

  • openshift/hypershift#7152: Prior work on the v2 ginkgo-based e2e-v2 test infrastructure used by the new orchestration and env var wiring.

Suggested reviewers

  • sjenning
  • jparrill
🚥 Pre-merge checks | ✅ 7 | ❌ 5

❌ Failed checks (5 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning One Expect assertion lacks a meaningful error message in control_plane_upgrade_test.go:68. Additionally, cleanup patterns are inconsistent between tests—some use defer while others use DeferCleanup. Add assertion message to line 68 Expect call. Standardize cleanup patterns to use DeferCleanup consistently throughout tests, which is idiomatic for Ginkgo v2.
Microshift Test Compatibility ⚠️ Warning Tests use MicroShift-unavailable APIs (MachineConfig, KubeletConfig, PerformanceProfile, etcd pods) without protection mechanisms like [Skipped:MicroShift] or [apigroup:...] labels. Add [Skipped:MicroShift] or [apigroup:...] tags to tests using unavailable APIs: MachineConfig, KubeletConfig, PerformanceProfile, and etcd chaos tests.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Tests assume multi-node/HA clusters without SNO skip guards: NodePoolRollingUpgradeTest (2 replicas), AutoscalingTests (3-4 nodes), EtcdChaosTests (multi-member etcd). Add [Skipped:SingleReplicaTopology] labels or exutil.IsSingleNode() skip checks to: NodePoolRollingUpgradeTest, AutoscalingScaleUpDownTest, AutoscalingBalancingTest, all EtcdChaosTests (5 tests).
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The test in nodepool_lifecycle_test.go (lines 1283, 1354) hardcodes external registry.access.redhat.com image pulls, failing IPv6-only disconnected CI requirements. Replace hardcoded registry.access.redhat.com/ubi9/ubi-minimal:latest with environment variable or cluster-internal registry reference for DaemonSet verification images to support disconnected environments.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: porting v1 lifecycle tests to v2 Ginkgo framework. It is specific, concise, and directly relates to the primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles are stable. No dynamic information (UUIDs, timestamps, pod/node names, IP addresses) found in test names. All titles use static semantic descriptions.
Topology-Aware Scheduling Compatibility ✅ Passed Not applicable: PR contains only test code and build system changes. Check requires "deployment manifests, operator code, or controllers"—none present here.
Ote Binary Stdout Contract ✅ Passed No violations found. test-e2e-v2 redirects all process-level output to stderr via PrintEnvVarHelp() to os.Stderr and BeforeSuite() zap logging. New test files have no problematic stdout writes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8527 May 15, 2026 13:13 Inactive
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.10%. Comparing base (10fd799) to head (b6a8b20).
⚠️ Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8527      +/-   ##
==========================================
+ Coverage   40.07%   40.10%   +0.02%     
==========================================
  Files         751      755       +4     
  Lines       92863    93018     +155     
==========================================
+ Hits        37215    37303      +88     
- Misses      52956    53016      +60     
- Partials     2692     2699       +7     

see 9 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.28% <ø> (-0.03%) ⬇️
cpo-hostedcontrolplane 40.58% <ø> (+0.02%) ⬆️
cpo-other 40.14% <ø> (ø)
hypershift-operator 50.61% <ø> (+0.09%) ⬆️
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (9)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go (2)

1543-1544: 💤 Low value

Misleading %w in log message.

The %w verb is a fmt directive for error wrapping, but r.Log.Error() doesn't perform formatting on the message string. The error is already passed as the first argument and will be logged correctly; the %w is printed literally.

Suggested fix
 if err != nil {
-    r.Log.Error(err, "failed to remove service ca annotation and secret: %w")
+    r.Log.Error(err, "failed to remove service ca annotation and secret")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 1543 - 1544, The log call in hostedcontrolplane_controller.go uses
the fmt verb "%w" in the message string (`r.Log.Error(err, "failed to remove
service ca annotation and secret: %w")`), which is meaningless for the logger
and will be printed literally; fix it by removing the "%w" from the message so
the error is passed as the first argument and logged correctly (e.g.,
`r.Log.Error(err, "failed to remove service ca annotation and secret")`), or
alternatively keep the error and add structured context via key/value pairs to
the same `r.Log.Error` call.

1721-1729: 💤 Low value

Same %w issue and unnecessary intermediate variable.

The log messages at lines 1722 and 1755 have the same misleading %w issue as flagged earlier. Additionally, the z variable at lines 1725-1726 and 1757-1759 is unnecessary—the reconcile function result can be returned directly.

Suggested fixes
 if err := removeServiceCAAnnotationAndSecret(ctx, r.Client, AzureDiskCsiDriverOperatorService, AzureDiskCsiDriverOperatorServingCert); err != nil {
-    r.Log.Error(err, "failed to remove service ca annotation and secret: %w")
+    r.Log.Error(err, "failed to remove service ca annotation and secret")
 }
 if _, err := createOrUpdate(ctx, r, AzureDiskCsiDriverOperatorServingCert, func() error {
-    z := pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(AzureDiskCsiDriverOperatorServingCert, rootCASecret, p.OwnerRef)
-    return z
+    return pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(AzureDiskCsiDriverOperatorServingCert, rootCASecret, p.OwnerRef)
 }); err != nil {

Apply the same pattern at lines 1754-1759 for the Azure File CSI driver operator.

Also applies to: 1752-1762

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`
around lines 1721 - 1729, The r.Log.Error calls use a printf-style %w
placeholder incorrectly and should be called as r.Log.Error(err, "failed to
remove service ca annotation and secret") (i.e., pass the error as the first
argument to r.Log.Error without formatting), and the reconciliation closures
create an unnecessary intermediate variable (z); replace constructs like "z :=
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...); return z"
with "return
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...)" inside the
createOrUpdate call (apply the same fixes for AzureDiskCsiDriverOperator* and
the analogous AzureFileCsiDriverOperator* usages, and for functions
removeServiceCAAnnotationAndSecret, createOrUpdate, and
pki.ReconcileAzure...ServingCertSecret).
api/karpenter/v1/kubelet_config.go (1)

50-50: 💤 Low value

Consider simplifying or documenting the complex eviction threshold validation.

The XValidation rule on line 50 is correct but extremely complex (>300 characters). It validates that evictionSoft >= evictionHard for matching signals, handling both percentage strings and quantity types. While the logic is sound, this level of complexity in a single CEL expression makes it difficult to:

  • Review for correctness
  • Debug when validation fails
  • Maintain if requirements change

Consider adding an inline comment above the validation explaining the high-level intent, or explore whether this could be split into separate validation rules (though CEL may not permit that given the cross-field nature).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/karpenter/v1/kubelet_config.go` at line 50, The XValidation CEL
expression on the kubebuilder annotation that enforces evictionSoft >=
evictionHard for matching signals is correct but unreadable; add a clear inline
comment directly above the annotation describing the high-level intent (e.g.,
"Ensure evictionSoft thresholds are greater than or equal to evictionHard for
the same signal, handling percentage strings and resource quantities") and
briefly note the two cases handled (percentage strings vs. resource quantities),
or alternatively factor the check into multiple, named XValidation annotations
if feasible; reference the annotation (the +kubebuilder:validation:XValidation
line) and the related fields evictionSoft and evictionHard so reviewers can find
and understand the rule quickly.
api/hypershift/v1beta1/hostedcluster_types.go (1)

676-681: ⚡ Quick win

Clarify pull secret propagation mechanism for in-place updates.

Lines 678-680 state that updating the Secret's data in place "does not trigger that rollout," but then immediately say the changes "will still propagate the updated credentials down to the guest cluster and kubelet config" for AWS/Azure Replace strategy. This appears contradictory—how do the credentials propagate without triggering a rollout?

Consider clarifying:

  • Does "propagate" mean existing nodes are updated in place without replacement?
  • Is this via a separate controller watching the Secret?
  • Does "rollout" specifically mean node replacement, while propagation is a different mechanism?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/hypershift/v1beta1/hostedcluster_types.go` around lines 676 - 681, The
comment is ambiguous about how Secret data changes "do not trigger that rollout"
yet "will still propagate" for AWS/Azure Replace NodePools; update the paragraph
(the NodePool/pull secret description) to explicitly state that “rollout” refers
to NodePool replacement/in-place upgrade actions (node recreation), and that
updating the referenced Secret's data in-place does not automatically trigger
node replacement, but cloud-provider-specific controllers or credential
reconciliation (e.g., the cloud provider credential controller or kubelet secret
refresh mechanisms) will propagate updated credentials to the guest cluster and
kubelet config for AWS/Azure Replace strategy; mention whether propagation is
in-place (no node replacement) and name the controller or mechanism if
available, or note “a separate controller/watch reconciler” if unspecified, so
readers understand the distinct behaviors for secret updates vs. NodePool
rollouts.
control-plane-operator/controllers/azureprivatelinkservice/controller.go (1)

863-881: 💤 Low value

Good refactoring of deletion logic into focused helpers.

The extraction of deletion logic into dedicated helper functions improves maintainability and testability. Using azPLS.Status.DNSZoneName to avoid dependency on the HostedControlPlane during deletion is a solid design decision.

The verbose log at line 873 when DNSZoneName is not set uses log.V(1), which may not be visible by default. Consider whether this should be a regular log.Info given that it represents a potentially unexpected condition during cleanup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`
around lines 863 - 881, The log for the case where azPLS.Status.DNSZoneName is
empty uses log.V(1).Info which may be too quiet for an unexpected cleanup
condition; update the logging in the controller deletion path (the dnsZoneName
check in the reconcile/finalizer routine that calls deleteDNSResources and
deleteBaseDomainResources) to use log.Info (or a higher visibility) instead of
log.V(1).Info so the message "DNSZoneName not set in status, skipping DNS
cleanup" is visible by default when DNSZoneName is missing during deletion.
control-plane-operator/controllers/azureprivatelinkservice/controller_test.go (1)

1747-1774: ⚡ Quick win

Consolidate duplicated errMsgQualifier coverage.

Line 1747 adds TestDNSZoneConfigErrMsgQualifier, but this is functionally duplicating TestErrMsgQualifier later in the file (same inputs and expected outputs). Keeping one avoids drift and redundant maintenance.

Suggested cleanup
-func TestDNSZoneConfigErrMsgQualifier(t *testing.T) {
-	t.Parallel()
-	tests := []struct {
-		name      string
-		logPrefix string
-		expected  string
-	}{
-		{
-			name:      "When logPrefix is empty, it should return empty string",
-			logPrefix: "",
-			expected:  "",
-		},
-		{
-			name:      "When logPrefix is set, it should return prefix followed by a space",
-			logPrefix: "base domain",
-			expected:  "base domain ",
-		},
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			t.Parallel()
-			g := NewGomegaWithT(t)
-			cfg := dnsZoneConfig{logPrefix: tt.logPrefix}
-			g.Expect(cfg.errMsgQualifier()).To(Equal(tt.expected))
-		})
-	}
-}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`
around lines 1747 - 1774, Remove the duplicated unit test
TestDNSZoneConfigErrMsgQualifier and keep the existing TestErrMsgQualifier to
avoid redundant coverage; locate the duplicate test function named
TestDNSZoneConfigErrMsgQualifier in the controller_test.go file (it constructs
dnsZoneConfig{logPrefix: ...} and calls cfg.errMsgQualifier()) and delete that
entire test block so only the original TestErrMsgQualifier remains exercising
dnsZoneConfig.errMsgQualifier.
contrib/repo_metrics/weekly_pr_report.py (2)

965-973: ⚡ Quick win

Add strict=True to zip() for safety.

Line 965 zips repos_to_fetch and results which should have the same length by construction, but adding strict=True makes this assumption explicit and catches bugs if the invariant is violated.

Note: strict= requires Python 3.10+.

🔒 Proposed fix
-        for (owner, name), result in zip(repos_to_fetch, results):
+        for (owner, name), result in zip(repos_to_fetch, results, strict=True):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/repo_metrics/weekly_pr_report.py` around lines 965 - 973, The loop
that pairs repos_to_fetch with results uses zip(repos_to_fetch, results) and
assumes equal lengths; change it to zip(repos_to_fetch, results, strict=True) to
enforce that invariant at runtime and fail fast if lengths differ; update the
loop in the block that assigns to repo_fetch_status and repo_prs (the for
(owner, name), result in zip(...) loop) so it uses strict=True, ensuring any
mismatch between repos_to_fetch and results is caught immediately.

402-403: 💤 Low value

Consider using immutable types for class constants.

Lines 402-403 use mutable list and set for class constants that are never modified. Converting to tuple and frozenset would make the immutability explicit and silence the static analysis warning.

♻️ Proposed refactor
-    BOT_PATTERNS = ['-bot', '-robot', '[bot]']
-    BOT_LOGINS = {'coderabbitai', 'hypershift-jira-solve-ci'}
+    BOT_PATTERNS = ('-bot', '-robot', '[bot]')
+    BOT_LOGINS = frozenset({'coderabbitai', 'hypershift-jira-solve-ci'})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/repo_metrics/weekly_pr_report.py` around lines 402 - 403,
BOT_PATTERNS and BOT_LOGINS are declared as mutable list/set but never modified;
change BOT_PATTERNS = ['-bot', '-robot', '[bot]'] to an immutable tuple
BOT_PATTERNS = ('-bot', '-robot', '[bot]') and change BOT_LOGINS =
{'coderabbitai', 'hypershift-jira-solve-ci'} to a frozenset: BOT_LOGINS =
frozenset({'coderabbitai', 'hypershift-jira-solve-ci'}), and update any
membership checks (e.g., "in BOT_LOGINS" or iterating BOT_PATTERNS) which will
continue to work without other code changes.
.claude/commands/pr-report.md (1)

732-755: 💤 Low value

Consider documenting alternative environment variable names for backward compatibility.

The Python script (lines 69-70) accepts both JIRA_API_TOKEN/JIRA_USERNAME and JIRA_TOKEN/JIRA_EMAIL, but the documentation only mentions the latter. Consider adding a note about the alternative names for users with existing configurations.

📝 Suggested documentation addition

Add a note under the environment variables table:

 | `GITHUB_TOKEN` | No | GitHub token (falls back to `gh auth token` if not set) |
+
+**Note:** For backward compatibility, the script also accepts `JIRA_USERNAME` (alias for `JIRA_EMAIL`) and `JIRA_API_TOKEN` (alias for `JIRA_TOKEN`).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/commands/pr-report.md around lines 732 - 755, Update the environment
variables docs to mention the alternative legacy names accepted by the script:
add a brief note under the variables table stating that JIRA_TOKEN/JIRA_EMAIL
are equivalent to the older JIRA_API_TOKEN/JIRA_USERNAME (the script checks both
pairs around lines where env vars are read), so existing configurations using
JIRA_API_TOKEN or JIRA_USERNAME will continue to work; reference the variable
names JIRA_TOKEN, JIRA_EMAIL, JIRA_API_TOKEN, and JIRA_USERNAME in the note.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/commands/pr-report.md:
- Around line 115-119: The detection patterns in .claude/commands/pr-report.md
don’t match the actual runtime messages; update the checks to use partial or
regex matching against the real strings — e.g., match "Fetching Jira data via
REST API" (prefix of "Fetching Jira data via REST API ({len(tickets)}
tickets)...") and "Jira credentials not set" (prefix of "Jira credentials not
set (need JIRA_USERNAME/JIRA_API_TOKEN or JIRA_EMAIL/JIRA_TOKEN), loading from
cache") instead of the exact old phrases, so the script recognizes both messages
with variable content.

In `@api/.golangci.yml`:
- Around line 322-342: The review asks to fix the API types instead of adding
kubeapilinter exclusions: update the KubeletConfiguration and
OpenshiftEC2NodeClassSpec types to satisfy kube-apilinter rules—specifically
change CPUCFSQuota (in KubeletConfiguration) from a *bool to a string enum with
meaningful constants; add a kubebuilder:validation:MinProperties marker where
the KubeletConfiguration object is used (referenced by
OpenshiftEC2NodeClassSpec.Kubelet) and make OpenshiftEC2NodeClassSpec.Kubelet a
pointer if it must be optional to avoid the optionalfields warning; replace map
values that use Go Durations (EvictionSoftGracePeriod) with integer types whose
names include units (e.g., EvictionSoftGracePeriodSeconds) and, for the nomaps
warnings on SystemReserved, KubeReserved, EvictionHard, EvictionSoft, refactor
map fields into well-defined structured types or resource-list fields (or
concrete slice/struct representations) so they no longer use raw maps; run make
api-lint-fix and iterate until kube-apilinter passes.

In `@api/karpenter/v1/kubelet_config.go`:
- Around line 196-212: HasTypedFields currently lists typed fields manually
(KubeletConfiguration.HasTypedFields) and can get out of sync with the struct,
causing IsZero/omitzero bugs; replace the manual checks with a reflection-based
implementation or add a unit test that fails if any exported non-overflow struct
field is not checked: implement HasTypedFields by reflecting over
KubeletConfiguration, skipping overflow/XXX fields and detecting any non-zero
typed field (or alternatively write a test that compares the set of field names
in kubeletConfigKnownFields/reflection against the ones asserted in
HasTypedFields to force updates), and ensure IsZero/omitzero behavior remains
correct for all current and future typed fields.

In `@cmd/cluster/core/create.go`:
- Around line 476-484: The code currently returns early when opts.PausedUntil ==
"true" so HostedCluster.Spec.PausedUntil is never set; change the logic in the
function handling opts.PausedUntil to: if opts.PausedUntil is empty return nil;
if opts.PausedUntil == "true" set cluster.Spec.PausedUntil = &opts.PausedUntil
and return nil; otherwise parse opts.PausedUntil with time.RFC3339, return a
wrapped error on parse failure and set cluster.Spec.PausedUntil =
&opts.PausedUntil on success (update the unit test expectation for
HostedCluster.Spec.PausedUntil when "--pausedUntil=true" accordingly).

In `@cmd/fix/dr_oidc_iam.go`:
- Around line 524-568: Add unit tests covering the new OIDC helper branch logic:
write table-driven unit tests for ensureOIDCDocuments and ensureOIDCProvider
that exercise combinations of DryRun true/false, ForceRecreate true/false, and
existing-state flags (oidcDocsExist and exists) to assert correct behavior
(no-op, dry-run messages, calls into generateAndUploadOIDCDocuments,
deleteOIDCProviderIfExists, and createOIDCProvider). Mock/stub client
interactions (k8sClient, s3Client, iamClient) and the helper methods
generateAndUploadOIDCDocuments, deleteOIDCProviderIfExists, createOIDCProvider
to verify they are or are not invoked, and assert returned errors are
propagated/wrapped as expected; include tests for success and error propagation
paths.
- Around line 496-511: The check currently collapses all S3 HeadObject failures
into a false result; change checkOIDCDocumentsExist to return (bool, error)
instead of bool, have it return (false, nil) only for a genuine NotFound/404 and
return (false, err) for AccessDenied/network/transient errors so callers can
distinguish causes, then update checkOIDCState to handle the propagated error
from checkOIDCDocumentsExist (inspect error != nil to log/return the error or
abort rather than printing "documents missing") while preserving the existing
branches for oidcDocsExist and o.ForceRecreate; update any call sites of
checkOIDCDocumentsExist accordingly.

In `@Containerfile.cli`:
- Line 22: The Docker base image reference "FROM
registry.redhat.io/ubi9/nginx-124:latest" is using the mutable :latest tag;
update that FROM line to pin to a specific immutable version tag or image digest
(e.g., a released version tag or sha256 digest) so builds are reproducible and
stable—locate the FROM statement in Containerfile.cli and replace ":latest" with
the chosen version tag or digest, then commit the change and optionally add a
short comment noting the pinned version.

In `@contrib/repo_metrics/weekly_pr_report.py`:
- Around line 144-145: In _fetch_json replace the blocking time.sleep(30) with
an await asyncio.sleep(30) so the coroutine yields to the event loop; update the
call to await asyncio.sleep(30) in the same fallback path that currently does
return await self._fetch_json(...), and ensure asyncio is imported at the module
top if not already present.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`:
- Around line 863-1053: The new deletion helpers lack unit tests; add
table-driven unit tests for deleteDNSResources, deleteVNetLink, deleteDNSZone,
deleteBaseDomainResources, deleteBaseDomainARecords, and deletePrivateEndpoint
(and exercise hasSiblingCR behavior) that cover: (1) Azure API failures
returning errors and asserting the wrapper returns the expected formatted error,
(2) NotFound responses are treated as success (no error), (3) correct resource
name derivation by asserting the Azure client is called with expected names
(e.g., kasARecordName/appsARecordName,
vnetLinkName(crName)/baseDomainVNetLinkName, privateEndpointName(crName), and
base domain A record prefixes), and (4) sibling-CR detection flow where
deleteBaseDomainResources deletes or skips the base zone/records depending on
hasSiblingCR; implement mocks/stubs for RecordSets, VirtualNetworkLinks,
PrivateDNSZones, PrivateEndpoints, and the hasSiblingCR dependency and use
context timeouts/poller behavior to simulate BeginDelete/PollUntilDone success,
error, and nil-poller cases.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go`:
- Around line 61-72: The helper assertCertVolumeCount should fail fast when the
returned slice lengths don't match expectations to avoid downstream panics: in
function assertCertVolumeCount (which calls certVolumesFromMonitors) replace the
t.Errorf checks for volume/mount count mismatches with t.Fatalf (or otherwise
return immediately with t.Fatalf) so the test stops on precondition failures
rather than allowing callers to index invalid slices; keep the existing error
check for certVolumesFromMonitors as-is and ensure callers can rely on asserted
counts when the helper returns.

---

Nitpick comments:
In @.claude/commands/pr-report.md:
- Around line 732-755: Update the environment variables docs to mention the
alternative legacy names accepted by the script: add a brief note under the
variables table stating that JIRA_TOKEN/JIRA_EMAIL are equivalent to the older
JIRA_API_TOKEN/JIRA_USERNAME (the script checks both pairs around lines where
env vars are read), so existing configurations using JIRA_API_TOKEN or
JIRA_USERNAME will continue to work; reference the variable names JIRA_TOKEN,
JIRA_EMAIL, JIRA_API_TOKEN, and JIRA_USERNAME in the note.

In `@api/hypershift/v1beta1/hostedcluster_types.go`:
- Around line 676-681: The comment is ambiguous about how Secret data changes
"do not trigger that rollout" yet "will still propagate" for AWS/Azure Replace
NodePools; update the paragraph (the NodePool/pull secret description) to
explicitly state that “rollout” refers to NodePool replacement/in-place upgrade
actions (node recreation), and that updating the referenced Secret's data
in-place does not automatically trigger node replacement, but
cloud-provider-specific controllers or credential reconciliation (e.g., the
cloud provider credential controller or kubelet secret refresh mechanisms) will
propagate updated credentials to the guest cluster and kubelet config for
AWS/Azure Replace strategy; mention whether propagation is in-place (no node
replacement) and name the controller or mechanism if available, or note “a
separate controller/watch reconciler” if unspecified, so readers understand the
distinct behaviors for secret updates vs. NodePool rollouts.

In `@api/karpenter/v1/kubelet_config.go`:
- Line 50: The XValidation CEL expression on the kubebuilder annotation that
enforces evictionSoft >= evictionHard for matching signals is correct but
unreadable; add a clear inline comment directly above the annotation describing
the high-level intent (e.g., "Ensure evictionSoft thresholds are greater than or
equal to evictionHard for the same signal, handling percentage strings and
resource quantities") and briefly note the two cases handled (percentage strings
vs. resource quantities), or alternatively factor the check into multiple, named
XValidation annotations if feasible; reference the annotation (the
+kubebuilder:validation:XValidation line) and the related fields evictionSoft
and evictionHard so reviewers can find and understand the rule quickly.

In `@contrib/repo_metrics/weekly_pr_report.py`:
- Around line 965-973: The loop that pairs repos_to_fetch with results uses
zip(repos_to_fetch, results) and assumes equal lengths; change it to
zip(repos_to_fetch, results, strict=True) to enforce that invariant at runtime
and fail fast if lengths differ; update the loop in the block that assigns to
repo_fetch_status and repo_prs (the for (owner, name), result in zip(...) loop)
so it uses strict=True, ensuring any mismatch between repos_to_fetch and results
is caught immediately.
- Around line 402-403: BOT_PATTERNS and BOT_LOGINS are declared as mutable
list/set but never modified; change BOT_PATTERNS = ['-bot', '-robot', '[bot]']
to an immutable tuple BOT_PATTERNS = ('-bot', '-robot', '[bot]') and change
BOT_LOGINS = {'coderabbitai', 'hypershift-jira-solve-ci'} to a frozenset:
BOT_LOGINS = frozenset({'coderabbitai', 'hypershift-jira-solve-ci'}), and update
any membership checks (e.g., "in BOT_LOGINS" or iterating BOT_PATTERNS) which
will continue to work without other code changes.

In
`@control-plane-operator/controllers/azureprivatelinkservice/controller_test.go`:
- Around line 1747-1774: Remove the duplicated unit test
TestDNSZoneConfigErrMsgQualifier and keep the existing TestErrMsgQualifier to
avoid redundant coverage; locate the duplicate test function named
TestDNSZoneConfigErrMsgQualifier in the controller_test.go file (it constructs
dnsZoneConfig{logPrefix: ...} and calls cfg.errMsgQualifier()) and delete that
entire test block so only the original TestErrMsgQualifier remains exercising
dnsZoneConfig.errMsgQualifier.

In `@control-plane-operator/controllers/azureprivatelinkservice/controller.go`:
- Around line 863-881: The log for the case where azPLS.Status.DNSZoneName is
empty uses log.V(1).Info which may be too quiet for an unexpected cleanup
condition; update the logging in the controller deletion path (the dnsZoneName
check in the reconcile/finalizer routine that calls deleteDNSResources and
deleteBaseDomainResources) to use log.Info (or a higher visibility) instead of
log.V(1).Info so the message "DNSZoneName not set in status, skipping DNS
cleanup" is visible by default when DNSZoneName is missing during deletion.

In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1543-1544: The log call in hostedcontrolplane_controller.go uses
the fmt verb "%w" in the message string (`r.Log.Error(err, "failed to remove
service ca annotation and secret: %w")`), which is meaningless for the logger
and will be printed literally; fix it by removing the "%w" from the message so
the error is passed as the first argument and logged correctly (e.g.,
`r.Log.Error(err, "failed to remove service ca annotation and secret")`), or
alternatively keep the error and add structured context via key/value pairs to
the same `r.Log.Error` call.
- Around line 1721-1729: The r.Log.Error calls use a printf-style %w placeholder
incorrectly and should be called as r.Log.Error(err, "failed to remove service
ca annotation and secret") (i.e., pass the error as the first argument to
r.Log.Error without formatting), and the reconciliation closures create an
unnecessary intermediate variable (z); replace constructs like "z :=
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...); return z"
with "return
pki.ReconcileAzureDiskCsiDriverOperatorMetricsServingCertSecret(...)" inside the
createOrUpdate call (apply the same fixes for AzureDiskCsiDriverOperator* and
the analogous AzureFileCsiDriverOperator* usages, and for functions
removeServiceCAAnnotationAndSecret, createOrUpdate, and
pki.ReconcileAzure...ServingCertSecret).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 36192739-e1b9-421b-85fb-9abe5b9a217a

📥 Commits

Reviewing files that changed from the base of the PR and between 10fd799 and beea747.

⛔ Files ignored due to path filters (37)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterUpdateAcceptRisks.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ClusterVersionOperatorConfiguration.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDC.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUIDAndExtraClaimMappings.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ExternalOIDCWithUpstreamParity.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HCPEtcdBackup.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/HyperShiftOnlyDynamicResourceAllocation.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/ImageStreamImportMode.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/KMSEncryptionProvider.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/hostedclusters.hypershift.openshift.io/TLSAdherence.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/karpenter/v1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • client/applyconfiguration/karpenter/v1/kubeletconfiguration.go is excluded by !client/**
  • client/applyconfiguration/karpenter/v1/openshiftec2nodeclassspec.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/hostedclusters-Hypershift-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Private_KAS_LoadBalancer.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Private_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_PublicAndPrivate_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/infra/testdata/zz_fixture_TestReconcileInfrastructure_AWS_Public_Route.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • karpenter-operator/controllers/karpenter/assets/karpenter.hypershift.openshift.io_openshiftec2nodeclasses.yaml is excluded by !karpenter-operator/controllers/karpenter/assets/*.yaml
  • karpenter-operator/controllers/karpenter/assets/zz_generated.crd-manifests/openshiftec2nodeclasses.crd.yaml is excluded by !**/zz_generated.crd-manifests/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/karpenter_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/kubelet_config.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/karpenter/v1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
📒 Files selected for processing (156)
  • .claude/agents/api-sme.md
  • .claude/commands/pr-report.md
  • .claude/rules/webhook-validation.md
  • .github/workflows/dependabot-commit-fix-reusable.yaml
  • .github/workflows/docs-deploy.yaml
  • .golangci.yml
  • .tekton/hypershift-gomaxprocs-webhook-pull-request.yaml
  • .tekton/hypershift-gomaxprocs-webhook-push.yaml
  • AGENTS.md
  • CLAUDE.md
  • CLAUDE.md
  • Containerfile.cli
  • Containerfile.control-plane
  • Containerfile.operator
  • Makefile
  • api/.golangci.yml
  • api/AGENTS.md
  • api/CLAUDE.md
  • api/hypershift/v1beta1/hostedcluster_types.go
  • api/karpenter/v1/karpenter_types.go
  • api/karpenter/v1/kubelet_config.go
  • api/karpenter/v1/kubelet_config_test.go
  • cmd/cluster/core/create.go
  • cmd/cluster/core/create_test.go
  • cmd/fix/dr_oidc_iam.go
  • cmd/infra/aws/create.go
  • cmd/infra/azure/create.go
  • cmd/infra/gcp/iam-bindings.json
  • cmd/infra/gcp/iam.go
  • cmd/infra/gcp/iam_test.go
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/install.go
  • cmd/install/install_render.go
  • cmd/install/install_render_test.go
  • cmd/install/install_test.go
  • contrib/ai/adding-marketplace-plugins.md
  • contrib/repo_metrics/weekly_pr_report.py
  • control-plane-operator/CLAUDE.md
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go
  • control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller_test.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller.go
  • control-plane-operator/controllers/azureprivatelinkservice/controller_test.go
  • control-plane-operator/controllers/gcpprivateserviceconnect/observer.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
  • control-plane-operator/controllers/hostedcontrolplane/kas/service_test.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/oauth/idp_convert_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/auth.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/auth_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/deployment_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/metrics_proxy/scrape_config_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth/idp_convert_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/drainer/drainer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/globalps.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/setup.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/setup_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/nodecount/controller.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/kas/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/registry/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/main.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
  • docs/content/how-to/ci/docs-preview.md
  • docs/content/how-to/ci/github-actions.md
  • docs/content/how-to/common/global-pull-secret.md
  • docs/content/how-to/gcp/configure-image-registry.md
  • docs/content/how-to/gcp/create-gcp-hosted-cluster.md
  • docs/content/how-to/gcp/index.md
  • docs/content/how-to/kubevirt/configuring-vm-with-jsonpatch.md
  • docs/mkdocs.yml
  • hack/github-actions-runner/README.md
  • hack/github-actions-runner/cache-warming-cronjob.yaml
  • hack/kubelet-ratcheting-gen/main.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery.go
  • hypershift-operator/controllers/hostedcluster/etcd_recovery_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/platform.go
  • hypershift-operator/controllers/hostedcluster/karpenter.go
  • hypershift-operator/controllers/hostedcluster/karpenter_test.go
  • hypershift-operator/controllers/hostedcluster/metrics/metrics.go
  • hypershift-operator/controllers/hostedcluster/network_policies.go
  • hypershift-operator/controllers/hostedcluster/network_policies_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_controller_test.go
  • hypershift-operator/controllers/hostedclustersizing/hostedclustersizing_validation_controller.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/metrics/metrics.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/platform/aws/controller.go
  • hypershift-operator/controllers/platform/aws/controller_test.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler.go
  • hypershift-operator/controllers/scheduler/aws/autoscaler_test.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes.go
  • hypershift-operator/controllers/scheduler/aws/dedicated_request_serving_nodes_test.go
  • hypershift-operator/controllers/scheduler/azure/controller.go
  • hypershift-operator/main.go
  • ignition-server/controllers/local_ignitionprovider.go
  • ignition-server/controllers/local_ignitionprovider_test.go
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet-field-promotion.testsuite.yaml
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet-ratcheting.testsuite.yaml
  • karpenter-operator/controllers/karpenter/assets/tests/openshiftec2nodeclasses.karpenter.hypershift.openshift.io/stable.openshiftec2nodeclasses.kubelet.testsuite.yaml
  • karpenter-operator/controllers/karpenter/karpenter_controller.go
  • karpenter-operator/controllers/karpenter/karpenter_controller_test.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller.go
  • karpenter-operator/controllers/karpenterignition/karpenterignition_controller_test.go
  • karpenter-operator/controllers/nodeclass/ec2_nodeclass_controller.go
  • karpenter-operator/controllers/nodeclass/karpenter_util.go
  • karpenter-operator/controllers/nodeclass/karpenter_util_test.go
  • support/api/scheme.go
  • support/controlplane-component/CLAUDE.md
  • support/karpenter/karpenter.go
  • support/karpenter/karpenter_test.go
  • support/podspec/containers.go
  • support/podspec/containers_test.go
  • support/validations/authentication.go
  • support/validations/authentication_test.go
  • test/e2e/create_cluster_test.go
  • test/e2e/karpenter_kubelet_checker_pod.yaml
  • test/e2e/karpenter_test.go
  • test/e2e/util/dump/journals.go
  • test/e2e/util/hypershift_framework.go
  • test/e2e/util/reqserving/verifycp.go
  • test/e2e/util/reqserving/verifyenv.go
  • test/e2e/util/util.go
  • test/e2e/v2/backuprestore/cleanup.go
  • test/e2e/v2/cmd/create-guests/main.go
  • test/e2e/v2/internal/env_vars.go
  • test/e2e/v2/tests/backup_restore_test.go
  • test/e2e/v2/tests/control_plane_upgrade_test.go
  • test/e2e/v2/tests/etcd_chaos_test.go
  • test/e2e/v2/tests/hosted_cluster_image_registry_test.go
  • test/e2e/v2/tests/nodepool_autoscaling_test.go
  • test/e2e/v2/tests/nodepool_lifecycle_test.go
  • test/integration/control_plane_pki_operator.go
💤 Files with no reviewable changes (9)
  • control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/drainer/drainer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/registry/admissionpolicies.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/nodecount/controller.go
  • control-plane-operator/controllers/gcpprivateserviceconnect/observer.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/node/node.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/globalps/globalps.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/kas/admissionpolicies.go

Comment thread .claude/commands/pr-report.md
Comment thread api/.golangci.yml
Comment on lines +196 to +212
// HasTypedFields reports whether any explicitly defined struct fields are set.
// This is used by IsZero, but is separate so we can differentiate the zero case
// from "only overflow fields set". This must be kept in sync with the typed
// fields in KubeletConfiguration.
func (k KubeletConfiguration) HasTypedFields() bool {
return k.MaxPods != 0 ||
k.PodsPerCore != 0 ||
k.SystemReserved != nil ||
k.KubeReserved != nil ||
k.EvictionHard != nil ||
k.EvictionSoft != nil ||
k.EvictionSoftGracePeriod != nil ||
k.EvictionMaxPodGracePeriod != nil ||
k.ImageGCHighThresholdPercent != nil ||
k.ImageGCLowThresholdPercent != nil ||
k.CPUCFSQuota != nil
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

HasTypedFields requires manual synchronization with struct fields.

This method must be manually updated whenever a new typed field is added to KubeletConfiguration. Unlike kubeletConfigKnownFields (which uses reflection to stay in sync automatically), this method has no compile-time or runtime check to ensure completeness. If a field is added but this method isn't updated, IsZero() will return incorrect results, breaking omitzero serialization behavior.

Consider one of:

  1. Add a test that uses reflection to verify all typed fields are checked here
  2. Implement this method using reflection (trading runtime cost for correctness)
  3. Add a prominent comment referencing the code review guideline about keeping this in sync

As per coding guidelines, API types should use omitzero correctly, and this is a potential source of bugs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/karpenter/v1/kubelet_config.go` around lines 196 - 212, HasTypedFields
currently lists typed fields manually (KubeletConfiguration.HasTypedFields) and
can get out of sync with the struct, causing IsZero/omitzero bugs; replace the
manual checks with a reflection-based implementation or add a unit test that
fails if any exported non-overflow struct field is not checked: implement
HasTypedFields by reflecting over KubeletConfiguration, skipping overflow/XXX
fields and detecting any non-zero typed field (or alternatively write a test
that compares the set of field names in kubeletConfigKnownFields/reflection
against the ones asserted in HasTypedFields to force updates), and ensure
IsZero/omitzero behavior remains correct for all current and future typed
fields.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR's changes — this file was modified in a different commit on main.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@bryan-cox, thanks for the clarification — I'll resolve this comment now since the concern doesn't apply to this PR.


🏁 Script executed:

#!/bin/bash
# List my (coderabbitai) comments on PR `#8527` to find the one to resolve
gh api repos/openshift/hypershift/pulls/8527/comments --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:80]}'

Length of output: 2593


🏁 Script executed:

#!/bin/bash
# Resolve the review thread containing comment 3248462110
# First, find the pull request review thread ID via GraphQL
gh api graphql -f query='
{
  repository(owner: "openshift", name: "hypershift") {
    pullRequest(number: 8527) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              databaseId
              body
            }
          }
        }
      }
    }
  }
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3248462110) | {threadId: .id, resolved: .isResolved}'

Length of output: 54


🏁 Script executed:

#!/bin/bash
# Resolve the review thread PRRT_kwDOE7ekcc6CXmgM
gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOE7ekcc6CXmgM"}) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 266


@bryan-cox, thanks for the clarification! I attempted to resolve this comment automatically but don't have the permissions to do so on this repository — you're welcome to resolve it manually.

Since the concern about HasTypedFields being prone to drift from the struct fields is still valid for the codebase, would you like me to open a follow-up GitHub issue to track it?

Comment on lines +476 to +484
if len(opts.PausedUntil) == 0 || opts.PausedUntil == "true" {
return nil
}
_, err := time.Parse(time.RFC3339, opts.PausedUntil)
if err != nil {
return fmt.Errorf("invalid pausedUntil value, should be \"true\" or a valid RFC3339 date format: %w", err)
}
cluster.Spec.PausedUntil = &opts.PausedUntil
return nil
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set HostedCluster.Spec.PausedUntil when --pausedUntil=true is used.

Line 476 returns early for "true", so Line 483 never sets cluster.Spec.PausedUntil. That contradicts the flag contract and creates inconsistent behavior (NodePools get paused later, HostedCluster does not).

💡 Proposed fix
func applyPausedUntil(cluster *hyperv1.HostedCluster, opts *CreateOptions) error {
  // validate pausedUntil value
  // valid values are either "true" or RFC3339 format date
- if len(opts.PausedUntil) == 0 || opts.PausedUntil == "true" {
+ if len(opts.PausedUntil) == 0 {
    return nil
  }
- _, err := time.Parse(time.RFC3339, opts.PausedUntil)
- if err != nil {
-   return fmt.Errorf("invalid pausedUntil value, should be \"true\" or a valid RFC3339 date format: %w", err)
+ if opts.PausedUntil != "true" {
+   _, err := time.Parse(time.RFC3339, opts.PausedUntil)
+   if err != nil {
+     return fmt.Errorf("invalid pausedUntil value, should be \"true\" or a valid RFC3339 date format: %w", err)
+   }
  }
  cluster.Spec.PausedUntil = &opts.PausedUntil
  return nil
}

And align the unit test expectation accordingly:

- pausedUntil: "true",
- expectSet:   false,
+ pausedUntil: "true",
+ expectSet:   true,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/core/create.go` around lines 476 - 484, The code currently
returns early when opts.PausedUntil == "true" so HostedCluster.Spec.PausedUntil
is never set; change the logic in the function handling opts.PausedUntil to: if
opts.PausedUntil is empty return nil; if opts.PausedUntil == "true" set
cluster.Spec.PausedUntil = &opts.PausedUntil and return nil; otherwise parse
opts.PausedUntil with time.RFC3339, return a wrapped error on parse failure and
set cluster.Spec.PausedUntil = &opts.PausedUntil on success (update the unit
test expectation for HostedCluster.Spec.PausedUntil when "--pausedUntil=true"
accordingly).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR's changes — this file was modified in a different commit on main.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryan-cox, thanks for the clarification — I'll drop this comment. The concern may still be worth tracking as a separate bug. Would you like me to open a follow-up GitHub issue for it?

Comment thread cmd/fix/dr_oidc_iam.go
Comment on lines +496 to +511
func (o *DrOidcIamOptions) checkOIDCState(ctx context.Context, s3Client *s3.Client, iamClient *iam.Client) (bool, string, bool, error) {
fmt.Println("Step 1: Checking OIDC documents in S3")
oidcDocsExist := o.checkOIDCDocumentsExist(ctx, s3Client)
if oidcDocsExist && !o.ForceRecreate {
fmt.Printf("- (%s) OIDC documents found in S3\n", greenCheck())
} else if o.ForceRecreate {
fmt.Printf("- (%s) Force recreate enabled - will regenerate OIDC documents\n", yellowForce())
} else {
fmt.Printf("- (%s) OIDC documents missing in S3 - will create them\n", redX())
}

fmt.Println("Step 2: Checking OIDC identity provider in IAM")
providerARN, exists, err := o.checkOIDCProvider(ctx, iamClient)
if err != nil {
return false, "", false, fmt.Errorf("failed to check OIDC provider: %w", err)
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

❓ Verification inconclusive

Script executed:

# First, locate and examine the file
find . -name "dr_oidc_iam.go" -type f

Repository: openshift/hypershift


Repository: openshift/hypershift
Exit code: 0

stdout:

./cmd/fix/dr_oidc_iam.go


Script executed:

# Once located, examine the checkOIDCDocumentsExist method implementation
rg -A 20 "func \(o \*DrOidcIamOptions\) checkOIDCDocumentsExist" cmd/fix/dr_oidc_iam.go

Repository: openshift/hypershift


Repository: openshift/hypershift
Exit code: 0

stdout:

func (o *DrOidcIamOptions) checkOIDCDocumentsExist(ctx context.Context, s3Client *s3.Client) bool {
	documents := []string{
		o.InfraID + oidcConfigPath,
		o.InfraID + oidc.JWKSURI,
	}

	for _, doc := range documents {
		_, err := s3Client.HeadObject(ctx, &s3.HeadObjectInput{
			Bucket: aws.String(o.OIDCStorageProviderS3Bucket),
			Key:    aws.String(doc),
		})
		if err != nil {
			return false
		}
	}
	return true
}

// configureBucketPublicAccess sets public read access on an S3 bucket by
// disabling the Block Public Access settings and applying a read-all policy.
func configureBucketPublicAccess(ctx context.Context, s3Client *s3.Client, bucketName string) error {


Script executed:

# Also examine the full checkOIDCState method
rg -A 50 "func \(o \*DrOidcIamOptions\) checkOIDCState" cmd/fix/dr_oidc_iam.go

Repository: openshift/hypershift


Repository: openshift/hypershift
Exit code: 0

stdout:

func (o *DrOidcIamOptions) checkOIDCState(ctx context.Context, s3Client *s3.Client, iamClient *iam.Client) (bool, string, bool, error) {
	fmt.Println("Step 1: Checking OIDC documents in S3")
	oidcDocsExist := o.checkOIDCDocumentsExist(ctx, s3Client)
	if oidcDocsExist && !o.ForceRecreate {
		fmt.Printf("- (%s) OIDC documents found in S3\n", greenCheck())
	} else if o.ForceRecreate {
		fmt.Printf("- (%s) Force recreate enabled - will regenerate OIDC documents\n", yellowForce())
	} else {
		fmt.Printf("- (%s) OIDC documents missing in S3 - will create them\n", redX())
	}

	fmt.Println("Step 2: Checking OIDC identity provider in IAM")
	providerARN, exists, err := o.checkOIDCProvider(ctx, iamClient)
	if err != nil {
		return false, "", false, fmt.Errorf("failed to check OIDC provider: %w", err)
	}

	if exists && !o.ForceRecreate {
		fmt.Printf("- (%s) OIDC identity provider exists\n", greenCheck())
	} else if o.ForceRecreate {
		fmt.Printf("- (%s) Force recreate enabled - will regenerate OIDC provider\n", yellowForce())
	} else {
		fmt.Printf("- (%s) OIDC identity provider missing - needs to be recreated\n", redX())
	}

	return oidcDocsExist, providerARN, exists, nil
}

func (o *DrOidcIamOptions) ensureOIDCDocuments(ctx context.Context, k8sClient client.Client, s3Client *s3.Client, oidcDocsExist bool) error {
	if oidcDocsExist && !o.ForceRecreate {
		fmt.Println("Step 4: OIDC documents")
		fmt.Printf("- (%s) OIDC documents already exist\n", greenCheck())
		return nil
	}

	fmt.Println("Step 4: Generating and uploading OIDC documents")
	if o.DryRun {
		fmt.Printf("- (%s) DRY RUN: Would generate and upload OIDC documents using existing cluster signing key\n", yellowQuestion())
		return nil
	}

	if err := o.generateAndUploadOIDCDocuments(ctx, k8sClient, s3Client); err != nil {
		return fmt.Errorf("failed to generate OIDC documents: %w", err)
	}
	fmt.Printf("- (%s) OIDC documents generated and uploaded\n", greenCheck())
	return nil
}

func (o *DrOidcIamOptions) ensureOIDCProvider(ctx context.Context, iamClient *iam.Client, providerARN, thumbprint string, exists bool) error {
	if exists && !o.ForceRecreate {
		fmt.Println("Step 6: OIDC identity provider")


Don't collapse all S3 errors into a "documents missing" check.

The checkOIDCDocumentsExist method (line ~498) returns bool only and treats every HeadObject error identically—AccessDenied, transient failures, and missing objects all return false. This masks the root cause and drives the wrong recovery path in checkOIDCState (lines 502–506), which prints "documents missing" when the actual problem may be a permission error or network failure.

Change checkOIDCDocumentsExist to return (bool, error), propagate non-404 errors, and handle them explicitly in checkOIDCState instead of treating all failures as missing documents.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/fix/dr_oidc_iam.go` around lines 496 - 511, The check currently collapses
all S3 HeadObject failures into a false result; change checkOIDCDocumentsExist
to return (bool, error) instead of bool, have it return (false, nil) only for a
genuine NotFound/404 and return (false, err) for AccessDenied/network/transient
errors so callers can distinguish causes, then update checkOIDCState to handle
the propagated error from checkOIDCDocumentsExist (inspect error != nil to
log/return the error or abort rather than printing "documents missing") while
preserving the existing branches for oidcDocsExist and o.ForceRecreate; update
any call sites of checkOIDCDocumentsExist accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR's changes — this file was modified in a different commit on main.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@bryan-cox, understood — thanks for the clarification. I'll resolve this comment now since it doesn't apply to this PR.


🏁 Script executed:

#!/bin/bash
# List coderabbitai's review comments on PR `#8527` to find the one to resolve
gh api repos/openshift/hypershift/pulls/8527/comments --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:80]}'

Length of output: 2593


🏁 Script executed:

#!/bin/bash
# Resolve the review comment with ID 3248462117
# First get the review thread ID using GraphQL
gh api graphql -f query='
{
  repository(owner: "openshift", name: "hypershift") {
    pullRequest(number: 8527) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              databaseId
            }
          }
        }
      }
    }
  }
}'  | jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3248462117) | {id: .id, isResolved: .isResolved}'

Length of output: 59


🏁 Script executed:

#!/bin/bash
# Resolve the review thread
gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOE7ekcc6CXmgT"}) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 266


@bryan-cox, apologies for the noise — you're right that cmd/fix/dr_oidc_iam.go isn't part of this PR's diff. I wasn't able to auto-resolve the thread (insufficient API permissions), so could you please resolve it manually?

Would you like me to open a follow-up GitHub issue to track the checkOIDCDocumentsExist error-handling concern for a future PR?

Comment thread cmd/fix/dr_oidc_iam.go
Comment thread Containerfile.cli
find ./bin -type f ! -name "*.tar.gz" -delete

FROM registry.redhat.io/ubi9/nginx-124:1-1767745334
FROM registry.redhat.io/ubi9/nginx-124:latest
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expectation: no production/runtime stage uses :latest
rg -n '^\s*FROM\s+\S+:latest(\s+AS\s+\S+)?\s*$' --iglob 'Containerfile*' --iglob 'Dockerfile*'

Repository: openshift/hypershift

Length of output: 662


Pin the base image to an immutable version.

The :latest tag on line 22 produces non-reproducible builds and can introduce unexpected runtime changes into releases. Replace with a specific version tag or digest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Containerfile.cli` at line 22, The Docker base image reference "FROM
registry.redhat.io/ubi9/nginx-124:latest" is using the mutable :latest tag;
update that FROM line to pin to a specific immutable version tag or image digest
(e.g., a released version tag or sha256 digest) so builds are reproducible and
stable—locate the FROM statement in Containerfile.cli and replace ":latest" with
the chosen version tag or digest, then commit the change and optionally add a
short comment noting the pinned version.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR's changes — this file was modified in a different commit on main.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@bryan-cox, understood — apologies for the noise. I'll resolve this comment now.

Would you like me to open a follow-up issue to track the :latest tag concern in Containerfile.cli so it's addressed in the appropriate PR?


🏁 Script executed:

#!/bin/bash
# Resolve the review comment made by coderabbitai on PR `#8527`
gh api graphql -f query='
{
  repository(owner: "openshift", name: "hypershift") {
    pullRequest(number: 8527) {
      reviewThreads(first: 50) {
        nodes {
          id
          isResolved
          comments(first: 5) {
            nodes {
              author { login }
              body
              url
            }
          }
        }
      }
    }
  }
}'

Length of output: 75563


🏁 Script executed:

#!/bin/bash
gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOE7ekcc6CXmgY"}) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 266


@bryan-cox, thanks for the clarification — apologies for the noise. This comment isn't applicable to your PR. I wasn't able to auto-resolve the thread due to a permissions limitation; feel free to resolve it manually.

Would you like me to open a follow-up GitHub issue to track pinning the nginx-124 base image in Containerfile.cli to an immutable version tag or digest?

Comment thread contrib/repo_metrics/weekly_pr_report.py
Register 6 new environment variables for lifecycle tests: release
images (latest, previous, N-1, N-2), Azure DES ID, and Azure
credentials file path.
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (4)
test/e2e/v2/tests/nodepool_autoscaling_test.go (1)

221-221: 💤 Low value

Clarify or remove unused nodeSelector parameter.

The nodeSelector parameter is always passed as an empty string (lines 83, 149) and the conditional at line 269 prevents the code from executing. If this parameter is intended for future use, the implementation at lines 270-272 appears incorrect—it uses the parameter value as the map key rather than parsing it as a key-value pair or label selector. If the parameter is not needed, consider removing it to simplify the function signature.

Also applies to: 269-273

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/nodepool_autoscaling_test.go` at line 221, The function
newAutoscalingWorkload currently accepts nodeSelector but never uses it (call
sites pass ""), and the existing code that attempts to use it in the
nodeSelector map is incorrect (it uses the whole string as a map key). Fix by
either removing the unused nodeSelector parameter from newAutoscalingWorkload
and all its callers to simplify the signature, or if you intend to support
selectors, replace the current logic with proper parsing of a "key=value" string
into a map (split on '=' and set map[key]=value) and validate empty/invalid
input; update the call sites accordingly and remove the dead conditional that
prevents execution when nodeSelector == "".
test/e2e/v2/tests/etcd_chaos_test.go (3)

139-150: 💤 Low value

Consider tracking deletion failures.

The test logs deletion errors but doesn't verify that at least some deletions succeeded. If all deletions fail (e.g., due to a permissions issue), the test would still pass because it only checks convergence afterward.

Optional: Track successful deletions
+	deletionCount := 0
 	for time.Now().Before(deadline) {
 		pod := randomEtcdPods(etcdPods.Items, 1)[0]
 		err := testCtx.MgmtClient.Delete(ctx, &pod, &crclient.DeleteOptions{
 			GracePeriodSeconds: ptr.To[int64](0),
 		})
 		if err != nil {
 			GinkgoWriter.Printf("Warning: failed to delete pod %s: %v\n", pod.Name, err)
 		} else {
 			GinkgoWriter.Printf("Deleted pod %s\n", pod.Name)
+			deletionCount++
 		}
 		time.Sleep(period)
 	}
+	Expect(deletionCount).To(BeNumerically(">", 0), "at least one pod deletion should succeed")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/etcd_chaos_test.go` around lines 139 - 150, The loop that
repeatedly deletes pods using randomEtcdPods and testCtx.MgmtClient.Delete only
logs failures and never asserts that any deletion actually succeeded; update the
loop to track successful deletions (e.g., increment a successCount when
testCtx.MgmtClient.Delete returns nil inside the for time.Now().Before(deadline)
loop), and after the loop assert/fail the test if successCount == 0 (using the
test framework's failure helper or Ginkgo.Fail) so the test fails fast if no pod
deletions were successful.

131-131: ⚡ Quick win

Add cleanup for marker ConfigMap.

The marker ConfigMap is created in the hosted cluster but never cleaned up. Consider adding cleanup logic in a DeferCleanup() call or using an AfterEach block to remove test artifacts.

🧹 Proposed cleanup pattern
 	// Create marker data that should survive the chaos
 	markerCM := createMarkerConfigMap(ctx, guestClient)
+	DeferCleanup(func() {
+		if err := guestClient.Delete(ctx, markerCM); err != nil && !apierrors.IsNotFound(err) {
+			GinkgoWriter.Printf("Warning: failed to cleanup marker ConfigMap: %v\n", err)
+		}
+	})

Apply the same pattern at line 170 in EtcdKillAllMembersTest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/etcd_chaos_test.go` at line 131, The marker ConfigMap
created by createMarkerConfigMap(ctx, guestClient) (assigned to markerCM) is not
cleaned up; add teardown to remove it by calling DeferCleanup(func() { delete
the ConfigMap using guestClient and markerCM.Name/Namespace }) immediately after
creation or add an AfterEach that deletes the marker ConfigMap; apply the same
pattern used in EtcdKillAllMembersTest cleanup to ensure the ConfigMap is
removed from the hosted cluster after the test finishes.

381-381: 💤 Low value

Handle error from MarshalText.

While time.Time.MarshalText() should never fail in practice, ignoring the error violates error-handling best practices.

📝 Proposed fix
-	value, _ := time.Now().MarshalText()
+	value, err := time.Now().MarshalText()
+	Expect(err).NotTo(HaveOccurred(), "failed to marshal timestamp")
 	cm := &corev1.ConfigMap{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/tests/etcd_chaos_test.go` at line 381, The assignment value, _ :=
time.Now().MarshalText() ignores the error from MarshalText; change it to
capture the error (e.g., value, err := time.Now().MarshalText()) and handle it
appropriately in the test by failing the test on error (use t.Fatalf/t.Fatal or
the test helper you use, e.g., require.NoError) so the test does not silently
ignore a potential MarshalText failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/cmd/create-guests/main.go`:
- Around line 138-243: The PR adds non-trivial functions (run, buildCLIArgs,
waitForClusterAvailable) with no unit tests; add table-driven unit tests: 1)
tests for buildCLIArgs covering combinations of flags (assignSPRoles,
generateSSH, optional fields like oidcIssuerURL, saTokenIssuerPrivateKeyPath,
workloadIdentitiesFile, dnsZoneRGName, etc.) asserting expected arg slices; 2)
tests for waitForClusterAvailable polling error-paths by faking a management
client (or using a client stub) to return not-found, transient errors, and a
final Available condition and asserting correct retry/poll behavior and error
wrapping; and 3) a minimal test for run that stubs exec.CommandContext (or
factor command execution out) to simulate a failing cmd.Run and verify the
returned wrapped error; locate these tests by referencing the functions
buildCLIArgs, waitForClusterAvailable, and run.
- Around line 227-230: The polling currently treats every error from
mgmtClient.Get(ctx, crclient.ObjectKey{Namespace: namespace, Name: name}, hc) as
retryable; change it to fail fast on non-retryable API/transport errors by
importing k8s.io/apimachinery/pkg/api/errors as apierrors and updating the error
handling: if err is not nil, then if the error is not a transient API error
(check apierrors.IsForbidden(err), apierrors.IsUnauthorized(err),
apierrors.IsBadRequest(err), apierrors.IsMethodNotAllowed(err) or any
non-Status/transport error) return false, err immediately; otherwise keep
logging and return false, nil to continue polling. Ensure you reference
mgmtClient.Get and the HostedCluster variable hc when making the change.
- Around line 145-146: The test binary is forwarding subprocess stdout to
os.Stdout (cmd.Stdout = os.Stdout), which can emit non-JSON text and break the
OTE JSON-on-stdout contract; change the child process stdout to not write to
process stdout (e.g., set cmd.Stdout = os.Stderr or a dedicated buffer) while
leaving cmd.Stderr = os.Stderr so that only the main process emits JSON on
os.Stdout; update the code that sets cmd.Stdout and cmd.Stderr (referenced as
cmd.Stdout and cmd.Stderr in main.go) accordingly.

In `@test/e2e/v2/tests/etcd_chaos_test.go`:
- Around line 285-301: The removal command uses shell substitution ($(…)) but is
passed as an argument slice to e2eutil.RunCommandInPod, so the substitution
won't be evaluated; update the call to run via a shell (e.g., build a single
command string from baseCommand/memberDiscoveryCommand and invoke sh -c) or
alternatively run member discovery separately (call e2eutil.RunCommandInPod with
the member-discovery command, parse its output to get the member ID, then call
RunCommandInPod with baseCommand + ["member","remove", memberID]); adjust usage
of baseCommand, memberDiscoveryCommand, command and the RunCommandInPod
invocation accordingly so the shell substitution is evaluated or replaced by the
discovered ID.

In `@test/e2e/v2/tests/nodepool_autoscaling_test.go`:
- Line 236: The test hardcodes Image:
"registry.access.redhat.com/ubi9/ubi-minimal:latest" which pulls from an
external registry; change the test to use an internal/mirrored image or a test
fixture image (e.g., replace the literal with a constant or config value such as
testImage or localRegistryImage) and/or gate the test so it uses a pre-populated
image when running in air-gapped CI; update the Image assignment in
nodepool_autoscaling_test.go (the struct/variable that sets Image) to reference
that internal/mirrored config value or a test-only image name and document the
expectation in the test setup so CI can pre-pull if needed.

In `@test/e2e/v2/tests/nodepool_lifecycle_test.go`:
- Around line 1179-1221: The test currently only waits for nodes to be ready
(e2eutil.WaitForReadyNodesByNodePool) but does not verify the AzureMachine
DiskEncryptionSetID (TODO), producing a false-positive; either implement the
verification by importing the CAPI Azure type (capiazure.AzureMachineList) and
asserting each AzureMachine.Spec.OSDisk.DiskEncryptionSetID equals the provided
E2E_AZURE_DISK_ENCRYPTION_SET_ID (using guestClient/testCtx.MgmtClient and the
NodePool np to list/filter machines in the control-plane namespace), or mark the
test to explicitly Skip() with a clear message until that AzureMachine-based
assertion is implemented. Ensure references to buildTestNodePool, np.Name,
cleanupNodePool remain unchanged.
- Around line 811-832: Before mutating HostedCluster.Spec.AdditionalTrustBundle,
capture the original value (e.g., orig := hc.Spec.AdditionalTrustBundle) and in
your deferred cleanup use e2eutil.UpdateObject (same helper used earlier) to
restore that original pointer/value to HostedCluster.Spec.AdditionalTrustBundle
instead of setting it to nil; additionally, in the same defer remove the created
trustBundle ConfigMap via testCtx.MgmtClient.Delete(ctx, trustBundle) so the
generated ConfigMap is cleaned up. Ensure you reference the existing variables
and functions: HostedCluster.Spec.AdditionalTrustBundle, trustBundle,
e2eutil.UpdateObject, and testCtx.MgmtClient.
- Around line 150-170: Add cleanup for the created management-cluster ConfigMap
(mcConfigMap) and the hosted-cluster DaemonSet (ds) by deleting them after the
test (preferably in an AfterEach or via defer immediately after creation); call
testCtx.MgmtClient.Delete(ctx, mcConfigMap) and guestClient.Delete(ctx, ds) and
ignore apierrors.IsNotFound() on delete, returning any other errors, so the
resources created by mcConfigMap, buildMachineConfigVerificationDaemonSet (ds),
guestClient, and testCtx.MgmtClient are removed and do not accumulate across
runs.
- Around line 1280-1283: The DaemonSet containers are using the unqualified
public image "alpine" (in the Containers slice for the DS created with variable
dsName), which breaks disconnected CI; replace that literal with a
repo-managed/mirrored test image reference (e.g., a shared test constant or
helper such as testImage, images.MirroredBusybox, or
testutils.MirroredTestImage) and update both occurrences (around the Containers
block at dsName and the other instance at lines ~1351-1354) so the test pulls
from the internal/mirrored registry instead of the public alpine image.
- Around line 1133-1174: The test always calls Skip() but only after creating a
NodePool and waiting for readiness; move the early Skip so no resources are
provisioned. In NodePoolAutoRepairTest, call Skip("auto-repair instance
termination not yet implemented for v2 framework") immediately after the
platform check (or at the start of the It block) and before invoking
getDefaultNodePool, buildTestNodePool, testCtx.MgmtClient.Create, or
e2eutil.WaitForReadyNodesByNodePool so the test exits without creating the
NodePool or waiting for nodes.

---

Nitpick comments:
In `@test/e2e/v2/tests/etcd_chaos_test.go`:
- Around line 139-150: The loop that repeatedly deletes pods using
randomEtcdPods and testCtx.MgmtClient.Delete only logs failures and never
asserts that any deletion actually succeeded; update the loop to track
successful deletions (e.g., increment a successCount when
testCtx.MgmtClient.Delete returns nil inside the for time.Now().Before(deadline)
loop), and after the loop assert/fail the test if successCount == 0 (using the
test framework's failure helper or Ginkgo.Fail) so the test fails fast if no pod
deletions were successful.
- Line 131: The marker ConfigMap created by createMarkerConfigMap(ctx,
guestClient) (assigned to markerCM) is not cleaned up; add teardown to remove it
by calling DeferCleanup(func() { delete the ConfigMap using guestClient and
markerCM.Name/Namespace }) immediately after creation or add an AfterEach that
deletes the marker ConfigMap; apply the same pattern used in
EtcdKillAllMembersTest cleanup to ensure the ConfigMap is removed from the
hosted cluster after the test finishes.
- Line 381: The assignment value, _ := time.Now().MarshalText() ignores the
error from MarshalText; change it to capture the error (e.g., value, err :=
time.Now().MarshalText()) and handle it appropriately in the test by failing the
test on error (use t.Fatalf/t.Fatal or the test helper you use, e.g.,
require.NoError) so the test does not silently ignore a potential MarshalText
failure.

In `@test/e2e/v2/tests/nodepool_autoscaling_test.go`:
- Line 221: The function newAutoscalingWorkload currently accepts nodeSelector
but never uses it (call sites pass ""), and the existing code that attempts to
use it in the nodeSelector map is incorrect (it uses the whole string as a map
key). Fix by either removing the unused nodeSelector parameter from
newAutoscalingWorkload and all its callers to simplify the signature, or if you
intend to support selectors, replace the current logic with proper parsing of a
"key=value" string into a map (split on '=' and set map[key]=value) and validate
empty/invalid input; update the call sites accordingly and remove the dead
conditional that prevents execution when nodeSelector == "".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4e829337-e54e-4140-aee6-909410e6f303

📥 Commits

Reviewing files that changed from the base of the PR and between f19b53f and 002d435.

📒 Files selected for processing (8)
  • Makefile
  • test/e2e/util/util.go
  • test/e2e/v2/cmd/create-guests/main.go
  • test/e2e/v2/internal/env_vars.go
  • test/e2e/v2/tests/control_plane_upgrade_test.go
  • test/e2e/v2/tests/etcd_chaos_test.go
  • test/e2e/v2/tests/nodepool_autoscaling_test.go
  • test/e2e/v2/tests/nodepool_lifecycle_test.go
✅ Files skipped from review due to trivial changes (2)
  • test/e2e/v2/internal/env_vars.go
  • test/e2e/v2/tests/control_plane_upgrade_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile

Comment thread test/e2e/v2/cmd/create-guests/main.go Outdated
Comment thread test/e2e/v2/cmd/create-guests/main.go Outdated
Comment thread test/e2e/v2/cmd/create-guests/main.go Outdated
Comment thread test/e2e/v2/tests/etcd_chaos_test.go Outdated
Comment thread test/e2e/v2/tests/nodepool_autoscaling_test.go
Comment thread test/e2e/v2/tests/nodepool_lifecycle_test.go
Comment thread test/e2e/v2/tests/nodepool_lifecycle_test.go
Comment thread test/e2e/v2/tests/nodepool_lifecycle_test.go
Comment thread test/e2e/v2/tests/nodepool_lifecycle_test.go
Comment thread test/e2e/v2/tests/nodepool_lifecycle_test.go
bryan-cox and others added 6 commits May 15, 2026 10:25
Port autoscaling scale-up/down and multi-NodePool balancing tests
from v1 to v2 Ginkgo framework.

Label: nodepool-autoscaling
Port TestHAEtcdChaos from v1 to v2 Ginkgo framework. Uses Ordered
decorator so tests run sequentially (etcd needs recovery between tests).
Assumes HA control plane.

Label: etcd-chaos

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port TestUpgradeControlPlane from v1 to v2 Ginkgo framework. Inlines
upgrade wait logic using EventuallyObject/EventuallyObjects (testing.TB)
since the v1 wrapper functions require *testing.T which is incompatible
with GinkgoTB().

The test updates the hosted cluster release image, then waits for:
1. ControlPlaneComponent resources to report RolloutComplete (4.20+)
2. ControlPlaneVersion status to reach Completed state (4.22+)
3. Data plane (CVO) version history to show CompletedUpdate

Post-upgrade validation checks (EnsureFeatureGateStatus,
EnsureNoCrashingPods, etc.) are deferred until the e2eutil functions
are refactored from *testing.T to testing.TB.

Label: control-plane-upgrade

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GinkgoTB() returns *GinkgoTBWrapper which implements testing.TB but not
*testing.T. Widening these signatures lets v2 tests call shared helpers
directly instead of inlining them. The change is backwards-compatible
since *testing.T satisfies testing.TB.

Functions that use t.Run() (Ensure*) cannot be widened and remain as
TODOs in the v2 upgrade test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port 13 NodePool sub-tests from v1 to v2 Ginkgo framework. Each sub-test
creates its own uniquely-named NodePool and cleans up, making tests
parallelism-safe for shared clusters.

Tests ported: MachineConfig rollout, NTO rollout (Replace + InPlace),
NodePool upgrade (Replace + InPlace), rolling upgrade, previous release
(N-1, N-2), mirror configs, trust bundle propagation, NTO performance
profile, auto-repair skeleton, and disk encryption skeleton.

Label: nodepool-lifecycle

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Go binary that wraps the hypershift CLI to create HA clusters for
lifecycle tests. CI calls this from the create-selfmanaged-guests
step instead of inline shell scripts, keeping cluster creation
logic co-located with the tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/v2/cmd/create-guests/main.go (1)

17-17: ⚡ Quick win

Use hosted-cluster terminology in v2 helper text.

Line 17 still uses “create-guests” wording. In test/e2e/v2, keep terminology aligned to “hosted cluster” / “control plane”.

As per coding guidelines "test/e2e/v2/**/*.go: Use terminology 'hosted cluster' and 'control plane'; never use 'guest cluster'".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/v2/cmd/create-guests/main.go` at line 17, Update the helper text to
use "hosted cluster" / "control plane" terminology instead of "guest" — replace
the comment and any CLI/help strings that say "create-guests" with wording like
"create-hosted-clusters" or "create hosted clusters" (and update any related
usage in package main / func main or CLI flag descriptions) so all references in
this file (e.g., the top-line comment that currently reads "create-guests
creates HA HostedClusters for v2 e2e lifecycle tests.") use the approved "hosted
cluster" / "control plane" terminology.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/v2/cmd/create-guests/main.go`:
- Line 17: Update the helper text to use "hosted cluster" / "control plane"
terminology instead of "guest" — replace the comment and any CLI/help strings
that say "create-guests" with wording like "create-hosted-clusters" or "create
hosted clusters" (and update any related usage in package main / func main or
CLI flag descriptions) so all references in this file (e.g., the top-line
comment that currently reads "create-guests creates HA HostedClusters for v2 e2e
lifecycle tests.") use the approved "hosted cluster" / "control plane"
terminology.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fc5108cd-991e-4e6e-9648-36b14975a5a1

📥 Commits

Reviewing files that changed from the base of the PR and between 002d435 and 03750f5.

📒 Files selected for processing (7)
  • Makefile
  • test/e2e/util/util.go
  • test/e2e/v2/cmd/create-guests/main.go
  • test/e2e/v2/tests/control_plane_upgrade_test.go
  • test/e2e/v2/tests/etcd_chaos_test.go
  • test/e2e/v2/tests/nodepool_autoscaling_test.go
  • test/e2e/v2/tests/nodepool_lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • Makefile
  • test/e2e/v2/tests/control_plane_upgrade_test.go
  • test/e2e/v2/tests/nodepool_autoscaling_test.go
  • test/e2e/util/util.go
  • test/e2e/v2/tests/etcd_chaos_test.go
  • test/e2e/v2/tests/nodepool_lifecycle_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/v2/cmd/create-guests/main.go`:
- Around line 708-709: The comment above envOrDefaultInt uses "unparseable"
which trips codespell; update the comment to use the preferred spelling
"unparsable" (e.g., change "// default if empty or unparseable." to "// default
if empty or unparsable.") so the function comment passes make verify-codespell
without changing behavior of envOrDefaultInt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: da616eb2-58b4-42f8-92e3-6d924aa62d9e

📥 Commits

Reviewing files that changed from the base of the PR and between 03750f5 and 8792cb9.

📒 Files selected for processing (6)
  • Dockerfile.e2e
  • Makefile
  • test/e2e/v2/cmd/create-guests/main.go
  • test/e2e/v2/cmd/destroy-guests/main.go
  • test/e2e/v2/cmd/dump-guests/main.go
  • test/e2e/v2/cmd/run-tests/main.go

Comment thread test/e2e/v2/cmd/create-guests/main.go Outdated
@bryan-cox bryan-cox force-pushed the CNTRLPLANE-3222 branch 2 times, most recently from e6229f7 to 87b8c03 Compare May 15, 2026 20:21
@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

Replace the bash scripts in the openshift/release step registry with Go
binaries that live in the hypershift repo. This moves CI orchestration
closer to the code it tests, making it easier to maintain and evolve.

Four binaries under test/e2e/v2/cmd/:

- create-guests: creates 4 Azure HostedClusters in parallel (public,
  private, oauth-lb, HA upgrade) with watch-based waiting instead of
  poll loops, OperatorConfiguration patching, and JUnit XML output
- run-tests: dispatches test-e2e-v2 per cluster with Ginkgo label
  filters, running upgrade + etcd-chaos sequentially on the HA cluster
- destroy-guests: tears down all 4 clusters with best-effort semantics
- dump-guests: collects diagnostic artifacts, always exits 0

Cluster names are derived deterministically from PROW_JOB_ID via
sha256, matching the existing bash convention. The Dockerfile.e2e and
Makefile are updated to build and ship all four binaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

1 similar comment
@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/test pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/test pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-v2-self-managed 2e3dbb1 link false /test e2e-azure-v2-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

- Etcd chaos: delete entire member directory instead of a single WAL
  file so etcd enters CrashLoopBackOff and the recovery controller
  detects it
- Autoscaling scale-up: add node labels to NodePool and nodeSelector
  to workload so pods target only the autoscaling NP, forcing the
  autoscaler to scale it up
- Autoscaling balancing: configure HostedCluster.Spec.Autoscaling with
  RandomExpander and BalancingIgnoredLabels so the autoscaler
  distributes across NodePools instead of favoring one
- Split autoscaling tests onto a dedicated cluster to reduce sequential
  test count in the public group below the step timeout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/ci-tooling Indicates the PR includes changes for CI or tooling area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants