Skip to content

CNTRLPLANE-3330: ci: improve unit test speed with sharding and parallelism#8330

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
bryan-cox:improve-ut-speed
Apr 27, 2026
Merged

CNTRLPLANE-3330: ci: improve unit test speed with sharding and parallelism#8330
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
bryan-cox:improve-ut-speed

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented Apr 24, 2026

What this PR does / why we need it:

Reduces CI unit test wall time from ~14 minutes to ~6 minutes by:

  1. Sharding tests across 5 parallel GitHub Actions matrix jobs (cpo-hostedcontrolplane, cpo-other, hypershift-operator, cmd-support, other)
  2. Adding t.Parallel() to the heaviest test files (101 tests parallelized across 6 files)
  3. Persisting the Go build cache with actions/cache for warm cache speedup
  4. Skipping unnecessary runs when only contrib, .github, or docs files change

Performance Results

Configuration cpo-hcp cpo-other hypershift-operator cmd-support other Wall Time vs Baseline
Baseline (single job) ~14 min
Cold cache (first run) 5m 23s 7m 47s 4m 38s 8m 57s 8m 52s 8m 57s 36% faster
Warm cache (subsequent runs) 5m 20s 4m 24s 4m 27s 6m 19s 5m 28s 6m 19s 55% faster

Follow-up work to eliminate cache upload/download overhead (~2-3 min/shard) via EFS PV: CNTRLPLANE-3329

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3330

Special notes for your reviewer:

  • Codecov integration verified: all 5 shard flags upload correctly and coverage merges properly
  • Tests using t.Setenv or global state mutation were intentionally left sequential
  • The test-shard Makefile target depends on generate because mock files (*_mock.go) are gitignored
  • workflow_dispatch events always run tests since there is no base commit to diff against

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 24, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a new setup job that computes a run_tests flag using git diff (ignoring docs/ and using ${{ github.event.before }}..${{ github.sha }} for non-PR events), runs make generate and uploads generated artifacts when needed. The prior single test job is replaced by a conditional sharded test matrix that runs make test-shard per shard, emits shard-specific coverage files (cover-<shard>.out), and uploads each to Codecov with the shard name as flags. A test-shard Make target and numerous t.Parallel() additions in Go tests were added.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Setup as setup Job
    participant Git as Git (diff)
    participant Artifact as upload-artifact
    participant Matrix as test Job (sharded matrix)
    participant Make as make test-shard
    participant Codecov as Codecov

    GH->>Setup: workflow triggered
    Setup->>Git: run git diff (exclude docs/, use before..sha)
    Git-->>Setup: changed_files (true|false)
    alt changed_files == true
        Setup->>Setup: run make generate
        Setup->>Artifact: upload generated files
        Setup-->>GH: set run_tests = 'true'
        GH->>Matrix: start sharded test matrix (run_tests == 'true')
        par shards
            Matrix->>Make: make test-shard (TEST_PACKAGES, COVER_PROFILE=cover-<shard>.out)
            Make-->>Matrix: produce cover-<shard>.out
        end
        Matrix->>Codecov: upload cover-<shard>.out with flags=<shard>
        Codecov-->>GH: report results
    else changed_files == false
        Setup-->>GH: set run_tests = 'false'
        GH-->>Matrix: skip tests
    end
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only standard Go test files using func Test* pattern with static test names, not Ginkgo DSL tests, so stable test name requirements do not apply.
Test Structure And Quality ✅ Passed PR modifies standard Go tests (func TestXxx) by adding t.Parallel() calls; custom check requires Ginkgo test patterns which are not present in changed files.
Microshift Test Compatibility ✅ Passed This PR does not introduce any new Ginkgo e2e tests; it only adds t.Parallel() calls to existing Go unit tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds t.Parallel() calls to 101 existing unit tests without introducing new Ginkgo e2e tests, making custom check inapplicable.
Topology-Aware Scheduling Compatibility ✅ Passed This pull request does not introduce any scheduling constraints that would affect Kubernetes cluster topology compatibility.
Ote Binary Stdout Contract ✅ Passed PR only adds t.Parallel() calls to test functions and updates CI configuration; no process-level code (main, init, TestMain, suite setup) that could write to stdout was introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies only existing Go unit tests by adding t.Parallel() calls; no new Ginkgo e2e tests are introduced.
Title check ✅ Passed The title clearly and specifically summarizes the main change: improving unit test speed through test sharding and parallelism, which aligns with the changeset that introduces CI matrix sharding, t.Parallel() calls across test files, and a new test-shard make target.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/test.yaml (1)

29-41: ⚠️ Potential issue | 🟠 Major

HEAD~1 misses commits in multi-commit pushes; use github.event.before...github.sha but handle new branches.

On push and workflow_dispatch, using HEAD~1 only inspects the last commit. A multi-commit push can therefore set run_tests=false even when earlier commits changed Go code. The suggested fix using ${{ github.event.before }}...${{ github.sha }} is the correct approach, but it has an unhandled edge case: when pushing to a new branch, github.event.before is set to the zero SHA (0000000000000000000000000000000000000000), which git diff cannot resolve as a valid object. The fix should either replace the zero SHA with the empty tree SHA (4b825dc642cb6eb9a060e54bf8d69288fbee4904) or add explicit handling for new branches. Additionally, consider forcing run_tests=true for workflow_dispatch since manual triggers should reliably run tests.

Suggested change
       - name: Check for non-contrib changes
         id: changes
         run: |
+          BEFORE="${{ github.event.before }}"
+          if [ "$BEFORE" = "0000000000000000000000000000000000000000" ]; then
+            BEFORE="4b825dc642cb6eb9a060e54bf8d69288fbee4904"
+          fi
+          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+            echo "run_tests=true" >> "$GITHUB_OUTPUT"
+            exit 0
+          elif [ "${{ github.event_name }}" = "pull_request" ]; then
             FILES=$(git diff --name-only origin/${{ github.base_ref }}...HEAD)
           else
-            FILES=$(git diff --name-only HEAD~1)
+            FILES=$(git diff --name-only "$BEFORE"...${{ github.sha }})
           fi
           if echo "$FILES" | grep -qvE '^(contrib|\.github)/'; then
             echo "run_tests=true" >> "$GITHUB_OUTPUT"
           else
             echo "run_tests=false" >> "$GITHUB_OUTPUT"
           fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yaml around lines 29 - 41, The "Check for non-contrib
changes" step uses HEAD~1 and misses multi-commit pushes; change the git diff
invocation to use the range github.event.before...github.sha and handle the
new-branch zero SHA by mapping github.event.before ==
"0000000000000000000000000000000000000000" to the empty-tree SHA
(4b825dc642cb6eb9a060e54bf8d69288fbee4904) before running git diff on FILES, and
also force run_tests=true when github.event_name == "workflow_dispatch" so
manual triggers always run tests; update the FILES variable assignment logic and
preserve the existing contrib/.github filter and run_tests output behavior.
🧹 Nitpick comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go (1)

231-342: Isolate subtest state in TestReconcileOLM to avoid order-coupled behavior.

errs, hcp, and reconciler/client state are shared across table rows. Even without subtest parallelism, this makes cases interdependent and harder to debug. Create fresh state per t.Run.

Refactor sketch
 func TestReconcileOLM(t *testing.T) {
 	t.Parallel()
-	var errs []error
-	hcp := fakeHCP()
-	hcp.Namespace = "openshift-operator-lifecycle-manager"
+	baseHCP := fakeHCP()
+	baseHCP.Namespace = "openshift-operator-lifecycle-manager"
 	fakeCPService := manifests.OLMPackageServerControlPlaneService(hcp.Namespace)
 	...
 	for _, tc := range testCases {
+		tc := tc
 		t.Run(tc.name, func(t *testing.T) {
 			g := NewWithT(t)
-			errs = append(errs, r.reconcileOLM(ctx, hcp, pullSecret)...)
+			errs := []error{}
+			hcp := baseHCP.DeepCopy()
+			cpClient := fake.NewClientBuilder().WithScheme(api.Scheme).WithObjects(rootCA, fakeCPService, hcp).Build()
+			hcClient := fake.NewClientBuilder().WithScheme(api.Scheme).WithObjects(rootCA, pullSecret).Build()
+			r := &reconciler{client: hcClient, cpClient: cpClient, CreateOrUpdateProvider: &simpleCreateOrUpdater{}, rootCA: "fake", ImageMetaDataProvider: &imageMetaDataProvider}
+			errs = append(errs, r.reconcileOLM(ctx, hcp, pullSecret)...)
 			hcp.Spec.Configuration = tc.hcpClusterConfig
 			hcp.Spec.OLMCatalogPlacement = tc.olmCatalogPlacement
 			errs = append(errs, r.reconcileOLM(ctx, hcp, pullSecret)...)
 			g.Expect(errs).To(BeEmpty(), "unexpected errors")
 			...
 		})
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`
around lines 231 - 342, The test shares mutable state across table rows (errs,
hcp, cpClient/hcCLient and the reconciler r) causing order-dependent failures;
fix by creating fresh state inside each t.Run: initialize errs (or use a local
slice), clone or recreate hcp (use fakeHCP()), build new fake clients (cpClient
and hcCLient) and a new reconciler (r) and imageMetaDataProvider per subtest
before calling r.reconcileOLM; ensure you set hcp.Spec.Configuration and
hcp.Spec.OLMCatalogPlacement on the subtest-local hcp and use the subtest-local
r.client for the Get() assertion so each case is fully isolated.
Makefile (1)

363-370: test-shard no longer matches test's prerequisites.

test runs generate first, while the new CI entrypoint skips it. If sharded CI is supposed to preserve the old make test behavior, add the same prerequisite here so both paths validate the same source tree.

Suggested change
 .PHONY: test-shard
-test-shard:
+test-shard: generate
 	`@echo` "Running shard tests for packages: $(TEST_PACKAGES)"
 	$(GO) test -race -parallel=$(NUM_CORES) -count=1 -timeout=30m $(TEST_PACKAGES) -coverprofile $(COVER_PROFILE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 363 - 370, The test-shard Makefile target skips the
generate prerequisite that test runs, causing inconsistent source validation;
update the test-shard target to depend on the same generate prerequisite (i.e.,
make test-shard depend on the generate target) so that the generate step runs
before invoking the test-shard recipe (ensure the target name test-shard and the
prerequisite generate are used exactly as in the Makefile).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yaml:
- Around line 49-60: The shard matrix omits packages (e.g., ./api/... and the
module root .) so some packages no longer build/test; update the matrix in the
strategy.include to add missing package groups (for example add entries like
shard: api with packages: ./api/... and shard: root with packages: .) or replace
the static include with a dynamic generation mechanism that derives shards from
go list ./... (invoke go list ./... and partition results into jobs) so the
matrix becomes exhaustive; update references in the matrix configuration (the
existing "matrix: include:" block and shard names such as
control-plane-operator, hypershift-operator, cmd-support, other) accordingly.
- Around line 70-79: The workflow now emits shard-specific coverage files
(cover-${{ matrix.shard }}.out) but .testcoverage.yml still expects cover.out;
to restore compatibility, add a step after the "Run tests" job that copies or
symlinks cover-${{ matrix.shard }}.out to cover.out (or alternatively
produce/merge a cover.out file) so tools reading .testcoverage.yml:8 find the
expected file; update the "Run tests" -> produced artifact handling (or the
"Upload to Codecov" step inputs) to continue using cover-${{ matrix.shard }}.out
while ensuring cover.out is created for local/go-test-coverage validation.

---

Outside diff comments:
In @.github/workflows/test.yaml:
- Around line 29-41: The "Check for non-contrib changes" step uses HEAD~1 and
misses multi-commit pushes; change the git diff invocation to use the range
github.event.before...github.sha and handle the new-branch zero SHA by mapping
github.event.before == "0000000000000000000000000000000000000000" to the
empty-tree SHA (4b825dc642cb6eb9a060e54bf8d69288fbee4904) before running git
diff on FILES, and also force run_tests=true when github.event_name ==
"workflow_dispatch" so manual triggers always run tests; update the FILES
variable assignment logic and preserve the existing contrib/.github filter and
run_tests output behavior.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go`:
- Around line 231-342: The test shares mutable state across table rows (errs,
hcp, cpClient/hcCLient and the reconciler r) causing order-dependent failures;
fix by creating fresh state inside each t.Run: initialize errs (or use a local
slice), clone or recreate hcp (use fakeHCP()), build new fake clients (cpClient
and hcCLient) and a new reconciler (r) and imageMetaDataProvider per subtest
before calling r.reconcileOLM; ensure you set hcp.Spec.Configuration and
hcp.Spec.OLMCatalogPlacement on the subtest-local hcp and use the subtest-local
r.client for the Get() assertion so each case is fully isolated.

In `@Makefile`:
- Around line 363-370: The test-shard Makefile target skips the generate
prerequisite that test runs, causing inconsistent source validation; update the
test-shard target to depend on the same generate prerequisite (i.e., make
test-shard depend on the generate target) so that the generate step runs before
invoking the test-shard recipe (ensure the target name test-shard and the
prerequisite generate are used exactly as in the Makefile).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c91a4fed-ae4d-4d72-af3d-38ceb5c47f70

📥 Commits

Reviewing files that changed from the base of the PR and between 3e0e06f and 589c983.

📒 Files selected for processing (8)
  • .github/workflows/test.yaml
  • Makefile
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller_test.go
  • control-plane-operator/controllers/hostedcontrolplane/infra/infra_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go

Comment thread .github/workflows/test.yaml Outdated
Comment thread .github/workflows/test.yaml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test.yaml (1)

52-60: ⚠️ Potential issue | 🟠 Major

Shard package partition still looks non-exhaustive.

The matrix package groups still don’t include some scopes (notably ./api/... and module root .), so this job no longer fully mirrors prior go test ./... coverage.

Minimal fix while keeping 4 shards
           - shard: other
-            packages: ./karpenter-operator/... ./control-plane-pki-operator/... ./contrib/... ./ignition-server/... ./pkg/... ./dnsresolver/... ./product-cli/... ./client/... ./test/integration/... ./availability-prober/... ./konnectivity-socks5-proxy/... ./kubernetes-default-proxy/... ./kubevirtexternalinfra/... ./etcd-defrag/...
+            packages: ./karpenter-operator/... ./control-plane-pki-operator/... ./contrib/... ./ignition-server/... ./pkg/... ./dnsresolver/... ./product-cli/... ./client/... ./test/integration/... ./availability-prober/... ./konnectivity-socks5-proxy/... ./kubernetes-default-proxy/... ./kubevirtexternalinfra/... ./etcd-defrag/... ./api/... .
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yaml around lines 52 - 60, The matrix shards are
missing the module root and api packages, so update the package lists (e.g., in
the "other" shard or an appropriate shard like "cmd-support") to include both
./api/... and ./ (module root) so the matrix mirrors go test ./...; modify the
include entry that lists packages (such as the "other" shard's packages) to add
./api/... and ./ so tests cover those scopes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yaml:
- Around line 35-36: The FILES assignment currently uses a single-commit diff
via "git diff --name-only HEAD~1" which misses changes on multi-commit pushes;
update the command that sets FILES (the line using git diff) to use the push SHA
range instead (use the GitHub push range variables such as
github.event.before..github.sha) so all commits in the push are compared; keep
the rest of the logic intact and ensure the variable name FILES and the git diff
invocation are the only parts changed.

---

Duplicate comments:
In @.github/workflows/test.yaml:
- Around line 52-60: The matrix shards are missing the module root and api
packages, so update the package lists (e.g., in the "other" shard or an
appropriate shard like "cmd-support") to include both ./api/... and ./ (module
root) so the matrix mirrors go test ./...; modify the include entry that lists
packages (such as the "other" shard's packages) to add ./api/... and ./ so tests
cover those scopes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 97e102ab-aafb-4ad6-a12e-96d95e56efc3

📥 Commits

Reviewing files that changed from the base of the PR and between 589c983 and d2da2fa.

📒 Files selected for processing (1)
  • .github/workflows/test.yaml

Comment thread .github/workflows/test.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test.yaml (1)

32-36: ⚠️ Potential issue | 🟠 Major

Use the push SHA range instead of HEAD~1.

Line 35 still only diffs the last commit on push, so a multi-commit push can incorrectly skip the whole test matrix when earlier commits changed code.

Suggested fix
           if [ "${{ github.event_name }}" = "pull_request" ]; then
             FILES=$(git diff --name-only origin/${{ github.base_ref }}...HEAD)
           else
-            FILES=$(git diff --name-only HEAD~1)
+            FILES=$(git diff --name-only "${{ github.event.before }}" "${{ github.sha }}")
           fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yaml around lines 32 - 36, The push branch diff uses
HEAD~1 which only checks the last commit; update the push branch case so FILES
is computed from the push SHA range instead: replace the FILES assignment in the
else branch (where it currently sets FILES=$(git diff --name-only HEAD~1)) with
a diff between ${{ github.before }} and ${{ github.sha }} (e.g. FILES=$(git diff
--name-only ${{ github.before }}...${{ github.sha }})), preserving the existing
if on github.event_name and handling the pull_request branch as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yaml:
- Line 37: The workflow guard currently skips the whole job when only files
matching the grep pattern '^(contrib|\.github|docs)/' changed, which allows
contrib package changes to bypass tests; update the conditional that checks
FILES (the if echo "$FILES" | grep -qvE '...') to remove "contrib" from the
exclusion (i.e., stop excluding contrib from the grep -qvE test) so changes
under contrib will cause this workflow to run; keep the rest of the regex for
.github and docs intact and ensure the other shard’s invocation of ./contrib/...
(around the call at line 60) remains compatible.

---

Duplicate comments:
In @.github/workflows/test.yaml:
- Around line 32-36: The push branch diff uses HEAD~1 which only checks the last
commit; update the push branch case so FILES is computed from the push SHA range
instead: replace the FILES assignment in the else branch (where it currently
sets FILES=$(git diff --name-only HEAD~1)) with a diff between ${{ github.before
}} and ${{ github.sha }} (e.g. FILES=$(git diff --name-only ${{ github.before
}}...${{ github.sha }})), preserving the existing if on github.event_name and
handling the pull_request branch as-is.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a6a4044d-bad9-4aef-b7a2-8c2d83432cbc

📥 Commits

Reviewing files that changed from the base of the PR and between d2da2fa and 797cb9f.

📒 Files selected for processing (1)
  • .github/workflows/test.yaml

Comment thread .github/workflows/test.yaml
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.32%. Comparing base (ae7a018) to head (a4b3600).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8330      +/-   ##
==========================================
+ Coverage   36.08%   36.32%   +0.23%     
==========================================
  Files         767      764       -3     
  Lines       93486    92886     -600     
==========================================
  Hits        33737    33737              
+ Misses      57042    56442     -600     
  Partials     2707     2707              

see 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 30.02% <ø> (?)
cpo-hostedcontrolplane 37.05% <ø> (?)
cpo-other 35.69% <ø> (?)
hypershift-operator 47.99% <ø> (?)
other 27.68% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
.github/workflows/test.yaml (1)

37-37: ⚠️ Potential issue | 🟠 Major

Don't exclude contrib/ from the trigger guard.

Line 60 still runs ./contrib/..., so a contrib-only change can currently skip the entire unit-test workflow.

Suggested fix
-          if echo "$FILES" | grep -qvE '^(contrib|\.github|docs)/'; then
+          if echo "$FILES" | grep -qvE '^(\.github|docs)/'; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yaml at line 37, The guard currently excludes
contrib/ by matching the pattern in the if statement that checks FILES with grep
-qvE; update that grep pattern (the line using if echo "$FILES" | grep -qvE
'^(contrib|\.github|docs)/') to stop excluding contrib — e.g., remove "contrib"
from the alternation so the pattern becomes '^(\\.github|docs)/' (or otherwise
ensure contrib/ changes are treated as non-skippable) so contrib-only changes
will not skip the unit-test workflow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yaml:
- Around line 61-70: The workflow sets GOCACHE and GOMODCACHE but never
restores/saves them, so caches are lost each run; add cache restore and save
steps (using actions/cache) surrounding the test step for the directories
pointed to by GOCACHE and GOMODCACHE (or switch to actions/setup-go with caching
enabled) so subsequent runs reuse Go build/module caches; specifically update
the job that contains the actions/checkout and the "Run tests" step and add
actions/cache steps that key on go-version and module checksum and target the
/tmp/go-build-cache and /tmp/go-mod-cache paths used by the env variables,
ensuring caches are restored before running make test-shard and saved after
tests complete.
- Around line 32-35: The workflow misses handling for workflow_dispatch which
leaves github.event.before undefined and yields an empty FILES; add an explicit
branch for when github.event_name == "workflow_dispatch" and set FILES there
instead of falling through to the else: compute the diff against a sensible
default base (for example origin/main or origin/${{ github.base_ref }} if
present) or fetch the default branch and compare origin/<default>...HEAD so
FILES gets a valid list; update the conditional block that sets FILES (the
existing if/else using github.event_name and the FILES variable) to include this
new branch.

---

Duplicate comments:
In @.github/workflows/test.yaml:
- Line 37: The guard currently excludes contrib/ by matching the pattern in the
if statement that checks FILES with grep -qvE; update that grep pattern (the
line using if echo "$FILES" | grep -qvE '^(contrib|\.github|docs)/') to stop
excluding contrib — e.g., remove "contrib" from the alternation so the pattern
becomes '^(\\.github|docs)/' (or otherwise ensure contrib/ changes are treated
as non-skippable) so contrib-only changes will not skip the unit-test workflow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcf0aabf-72bd-4c3f-b1df-4b1e1f316e86

📥 Commits

Reviewing files that changed from the base of the PR and between 797cb9f and 13da612.

📒 Files selected for processing (2)
  • .github/workflows/test.yaml
  • Makefile
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile

Comment thread .github/workflows/test.yaml Outdated
Comment thread .github/workflows/test.yaml
@bryan-cox bryan-cox force-pushed the improve-ut-speed branch 2 times, most recently from 75b93bf to 47dd2ab Compare April 24, 2026 12:55
@bryan-cox bryan-cox changed the title ci: improve unit test speed with sharding and parallelism CNTRLPLANE-3330: ci: improve unit test speed with sharding and parallelism Apr 24, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 24, 2026

@bryan-cox: This pull request references CNTRLPLANE-3330 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Shard unit tests into 4 parallel CI matrix jobs (control-plane-operator, hypershift-operator, cmd-support, other) to reduce wall-clock time
  • Add Go build caching via actions/setup-go to avoid recompiling unchanged packages across CI runs
  • Add t.Parallel() to 101 test functions across the 6 largest test files
  • Each shard uploads coverage independently; Codecov merges them via flags

Test plan

  • Verify all 4 shards pass in CI
  • Verify Codecov receives and merges coverage from all shards
  • Compare total CI wall-clock time before/after

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

  • Enabled broad parallel execution for many test suites and subtests to speed up test runs.

  • Chores

  • CI now detects relevant changes (ignoring docs), conditionally runs tests, generates and uploads test artifacts, and shards test execution.

  • Coverage is uploaded per-shard with shard tags; build tooling gained a parameterized per-shard test target and configurable coverage output.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox force-pushed the improve-ut-speed branch 2 times, most recently from 6e3982c to 9130d14 Compare April 24, 2026 15:09
@bryan-cox bryan-cox marked this pull request as ready for review April 24, 2026 15:19
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@openshift-ci openshift-ci Bot requested review from clebs and enxebre April 24, 2026 15:21
@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by @bryan-cox

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by @bryan-cox.

Details

In response to this:

/verified by @bryan-cox

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clebs
Copy link
Copy Markdown
Member

clebs commented Apr 27, 2026

/lgtm

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2048714747523633152 | Cost: $1.9682264999999997 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD a6e1d11 and 1 for PR HEAD 9130d14 in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2048737892141895680 | Cost: $1.8725199999999997 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Apr 27, 2026

I now have all the evidence needed to produce the final report. The failure is clear: a single test (TestCreateCluster) failed due to AWS EC2 API rate limiting (HTTP 503, RequestLimitExceeded) on the CreateVpcEndpoint operation for AWS account 820196288204. This is an infrastructure/environment issue completely unrelated to the PR changes (which only modify unit test sharding in GitHub Actions and the Makefile).

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2048737892141895680
  • Target: e2e-aws
  • PR: #8330CNTRLPLANE-3330: ci: improve unit test speed with sharding and parallelism
  • Failed Step: e2e-aws-hypershift-aws-run-e2e-nested
  • Test Results: 426 tests total, 25 skipped, 1 failure (TestCreateCluster)

Test Failure Analysis

Error

hypershift_framework.go:501: failed to create cluster, tearing down: failed to create infra:
cannot create VPC S3 endpoint: operation error EC2: CreateVpcEndpoint, exceeded maximum number
of attempts, 11, https response error StatusCode: 503, RequestID: 41ae8c00-2d53-4db9-a85f-d2fdca5a2e61,
api error RequestLimitExceeded: Request limit exceeded. Account 820196288204 has been throttled
on ec2:CreateVpcEndpoint because it exceeded its request rate limit.

Summary

The TestCreateCluster e2e test failed because the shared AWS CI account (820196288204) hit the EC2 API rate limit for the CreateVpcEndpoint operation. After 11 retry attempts, all returned HTTP 503 with RequestLimitExceeded. This is an AWS-side throttling issue caused by too many concurrent CI jobs sharing the same AWS account — it is not related to the PR changes, which only modify unit test sharding/parallelism in the GitHub Actions workflow and Makefile, and do not touch any e2e test code, infrastructure provisioning, or VPC endpoint logic.

Root Cause

AWS EC2 API rate limiting on the shared CI account.

The TestCreateCluster test requires provisioning a new HyperShift hosted cluster on AWS, which includes creating VPC endpoints (specifically an S3 VPC endpoint). The AWS account 820196288204 used by CI hit the ec2:CreateVpcEndpoint rate limit, returning HTTP 503 RequestLimitExceeded on all 11 retry attempts.

This is a transient infrastructure issue:

  • The AWS EC2 API enforces per-account rate limits on control plane operations like CreateVpcEndpoint
  • When multiple CI jobs run concurrently against the same AWS account, they can collectively exhaust the API rate limit
  • The 11-retry backoff was insufficient to wait out the throttling window

Why this is unrelated to PR #8330:

  • The PR modifies only: .github/workflows/test.yaml (GitHub Actions unit test sharding), Makefile (adding test-shard target), and 6 *_test.go unit test files
  • No e2e test code, infrastructure provisioning code, or HyperShift framework code is changed
  • The hypershift_framework.go:501 error originates from unchanged infrastructure provisioning code
  • All other 425 tests (400 passed, 25 skipped) completed successfully, confirming the PR has no impact on e2e test behavior
Recommendations
  1. Retest the job — This is a transient AWS throttling issue. Simply re-running the job (/retest) when the AWS account has lower concurrent load should resolve it.
  2. No code changes needed — The PR changes (unit test sharding) are completely orthogonal to the e2e test failure. The PR can be merged once the e2e job passes on a retest.
  3. Long-term mitigation (for the HyperShift CI team): Consider implementing more aggressive exponential backoff with jitter for AWS VPC endpoint creation, or request an AWS API rate limit increase for the CI account 820196288204.
Evidence
Evidence Detail
Failed test TestCreateCluster (1 of 426 tests)
Error type AWS EC2 API rate limiting (RequestLimitExceeded)
AWS API call ec2:CreateVpcEndpoint (VPC S3 endpoint)
HTTP status 503 Service Unavailable
AWS Account 820196288204 (shared CI account)
Retry attempts 11 (all exhausted)
AWS Request ID 41ae8c00-2d53-4db9-a85f-d2fdca5a2e61
Error source hypershift_framework.go:501 (unchanged code)
PR files changed .github/workflows/test.yaml, Makefile, 6 unit test files
PR touches e2e code No — only unit test sharding infrastructure
Other tests 400 passed, 25 skipped — all healthy
Test duration 105.32s (including 40s teardown)

@bryan-cox
Copy link
Copy Markdown
Member Author

/override "ci/prow/e2e-aws"

Not related to this PR; no need to keep wasting resources since its not related.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-aws

Details

In response to this:

/override "ci/prow/e2e-aws"

Not related to this PR; no need to keep wasting resources since its not related.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bryan-cox and others added 2 commits April 27, 2026 12:14
Split unit tests into 5 parallel GitHub Actions matrix shards
(cpo-hostedcontrolplane, cpo-other, hypershift-operator, cmd-support,
other) to reduce CI wall time from ~14 min to ~6 min.

Add a test-shard Makefile target that runs tests for a specified set
of packages with coverage output. Persist the Go build cache across
runs using actions/cache to speed up warm cache builds.

Add change detection to skip test jobs when only contrib, .github,
or docs files are modified. Handle workflow_dispatch events by
always running tests since there is no base commit to diff against.

Upload per-shard coverage profiles to Codecov with shard-specific
flags for proper coverage merging.

JIRA: CNTRLPLANE-3330

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add t.Parallel() to tests in the heaviest test files to improve
test execution speed within each shard. Tests that use t.Setenv
or modify global state are left sequential.

Files modified:
- hostedcluster_controller_test.go (39 tests parallelized)
- hostedcontrolplane_controller_test.go (8 tests parallelized)
- infra_test.go (7 tests parallelized)
- nodepool_controller_test.go (15 tests parallelized)
- capi_test.go (13 tests parallelized)
- resources_test.go (19 tests parallelized)

JIRA: CNTRLPLANE-3330

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2026
@clebs
Copy link
Copy Markdown
Member

clebs commented Apr 27, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by ut

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by ut.

Details

In response to this:

/verified by ut

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/override ci/prow/e2e-aks-4-22
/override ci/prow/e2e-aws-4-22
/override ci/prow/e2e-aks
/override ci/prow/e2e-aws
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-v2-aws

@bryan-cox
Copy link
Copy Markdown
Member Author

Overriding, this is just UT changes and I only had to rebase because of verify-workflows. No need to rerun all those tests and waste resources.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-aks, ci/prow/e2e-aks-4-22, ci/prow/e2e-aws, ci/prow/e2e-aws-4-22, ci/prow/e2e-aws-upgrade-hypershift-operator, ci/prow/e2e-azure-self-managed, ci/prow/e2e-kubevirt-aws-ovn-reduced, ci/prow/e2e-v2-aws

Details

In response to this:

/override ci/prow/e2e-aks-4-22
/override ci/prow/e2e-aws-4-22
/override ci/prow/e2e-aks
/override ci/prow/e2e-aws
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-v2-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/override ci/prow/e2e-aks-4-22
/override ci/prow/e2e-aws-4-22
/override ci/prow/e2e-aks
/override ci/prow/e2e-aws
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-v2-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: ci/prow/e2e-aks, ci/prow/e2e-aks-4-22, ci/prow/e2e-aws, ci/prow/e2e-aws-4-22, ci/prow/e2e-aws-upgrade-hypershift-operator, ci/prow/e2e-azure-self-managed, ci/prow/e2e-kubevirt-aws-ovn-reduced, ci/prow/e2e-v2-aws

Details

In response to this:

/override ci/prow/e2e-aks-4-22
/override ci/prow/e2e-aws-4-22
/override ci/prow/e2e-aks
/override ci/prow/e2e-aws
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-v2-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/override "Red Hat Konflux / enterprise-contract-mce-50 / hypershift-release-mce-50"

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@bryan-cox: Overrode contexts on behalf of bryan-cox: Red Hat Konflux / enterprise-contract-mce-50 / hypershift-release-mce-50

Details

In response to this:

/override "Red Hat Konflux / enterprise-contract-mce-50 / hypershift-release-mce-50"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit e7e828e into openshift:main Apr 27, 2026
43 of 44 checks passed
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 27, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bryan-cox bryan-cox deleted the improve-ut-speed branch April 27, 2026 17:31
@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2048805409073926144 | Cost: $2.5443867499999997 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2048806801842900993 | Cost: $2.2605464999999993 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants