ci: parallelize e2e container image and wheel builds#539
Conversation
❌ Deploy Preview for jumpstarter-docs failed. Why did it fail? →
|
📝 WalkthroughWalkthroughCI was restructured to produce controller/operator images and Python wheels in parallel build jobs; e2e and compatibility jobs now consume these artifacts, use SKIP_BUILD/PREBUILT_WHEELS_DIR toggles, and scripts/Makefile were adjusted to support prebuilt artifact installation and build skipping. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions"
participant Runner as "CI Runner"
participant Artifact as "Artifact Store"
participant Docker as "Docker Engine"
participant E2E as "e2e setup scripts"
GH->>Runner: start parallel build jobs (controller / operator / wheels)
Runner->>Docker: build per-arch images & package artifacts
Runner->>Artifact: upload image tarballs and wheel files
GH->>Runner: trigger e2e & compat jobs (needs: builds)
Runner->>Artifact: download tarballs & wheels
Runner->>Docker: docker load image tarballs
Runner->>E2E: run setup with SKIP_BUILD / PREBUILT_WHEELS_DIR
E2E->>E2E: install prebuilt wheels into venv (or fallback)
E2E->>Docker: deploy controller/operator (skip local build if SKIP_BUILD)
E2E->>Runner: execute tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e.yaml:
- Around line 66-72: The cache key for the "Cache controller image" step
currently omits the controller's build recipe; update the key generation (the
key value in the cache step with id "cache") to include the Makefile that
controls the image build (add 'controller/Makefile' to the hashFiles list) so
changes to the controller image recipe or defaults invalidate the cache used by
the make -C controller docker-build flow.
In `@controller/Makefile`:
- Around line 196-198: The Makefile currently skips the build-operator target
when SKIP_BUILD is set, which breaks operator deployments that still apply
deploy/operator/dist/install.yaml; change the conditional so build-operator
always runs when METHOD is set to "operator" (i.e., only skip build-operator if
SKIP_BUILD is set AND METHOD != operator). Update the conditional around the
build-operator invocation (referencing SKIP_BUILD and the build-operator target)
so ./hack/deploy_with_operator.sh can rely on a fresh install.yaml;
alternatively add an explicit prerequisite or unconditional invocation for
build-operator when METHOD=operator.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ec512512-8c0c-439a-ab54-504e030c5fe1
📒 Files selected for processing (4)
.github/workflows/e2e.yamlcontroller/Makefilee2e/compat/setup.she2e/setup-e2e.sh
6b9a43c to
c0bc380
Compare
34b4d70 to
96b2893
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e.yaml:
- Around line 171-172: The workflow currently lists matrix job IDs in the needs
field (e.g., e2e-tests needs: [build-controller-image, build-operator-image,
build-python-wheels]) which causes downstream jobs to wait for all matrix
variants; update the workflow to remove matrix job IDs and instead create
separate per-architecture job IDs for the image builds (e.g.,
build-controller-image-amd64, build-controller-image-arm64,
build-operator-image-amd64, build-operator-image-arm64), change the e2e-tests
and e2e-compat-old-client jobs to depend only on the matching per-arch job ID
(e.g., e2e-tests[amd64] needs build-controller-image-amd64 and
build-operator-image-amd64 and build-python-wheels if needed), and ensure any
other references to build-controller-image or build-operator-image are updated
to the new non-matrix job IDs so each downstream job only waits for the
corresponding architecture build.
In `@e2e/compat/setup.sh`:
- Around line 170-180: The script currently treats PREBUILT_WHEELS_DIR being set
but empty/missing the same as unset; change the logic so that if
PREBUILT_WHEELS_DIR is defined (non-empty) you verify the directory exists and
contains .whl files and otherwise fail fast with a clear error and exit 1.
Update the branch around PREBUILT_WHEELS_DIR in setup.sh: when
PREBUILT_WHEELS_DIR is set, check -d "${PREBUILT_WHEELS_DIR}" and that a glob
like "${PREBUILT_WHEELS_DIR}"/*.whl actually expands to files; on failure call
your logging error helper (e.g., log_info/log_error) and exit 1; only allow the
fallback to make sync when PREBUILT_WHEELS_DIR is unset.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ba16dcc7-25b6-4412-99c7-c9d1f36276e2
📒 Files selected for processing (5)
.github/workflows/e2e.yamlcontroller/Makefilee2e/compat/setup.she2e/setup-e2e.she2e/tests.bats
💤 Files with no reviewable changes (1)
- e2e/tests.bats
✅ Files skipped from review due to trivial changes (2)
- controller/Makefile
- e2e/setup-e2e.sh
0e67e92 to
ef0001a
Compare
e2e/setup-e2e.sh
Outdated
| install_jumpstarter() { | ||
| log_info "Installing jumpstarter..." | ||
|
|
||
| cd "$REPO_ROOT" | ||
| cd python | ||
| make sync | ||
| cd .. | ||
| if [ -n "${PREBUILT_WHEELS_DIR:-}" ]; then | ||
| if [ ! -d "${PREBUILT_WHEELS_DIR}" ]; then | ||
| log_error "PREBUILT_WHEELS_DIR is set but directory does not exist: ${PREBUILT_WHEELS_DIR}" | ||
| exit 1 | ||
| fi | ||
| local whl_count | ||
| whl_count=$(find "${PREBUILT_WHEELS_DIR}" -maxdepth 1 -name '*.whl' | wc -l) | ||
| if [ "$whl_count" -eq 0 ]; then | ||
| log_error "PREBUILT_WHEELS_DIR contains no .whl files: ${PREBUILT_WHEELS_DIR}" | ||
| exit 1 | ||
| fi | ||
| log_info "Installing from pre-built wheels in ${PREBUILT_WHEELS_DIR} (${whl_count} wheels)..." | ||
| cd python | ||
| uv venv .venv --python 3.12 | ||
| uv pip install --python .venv/bin/python "${PREBUILT_WHEELS_DIR}"/*.whl | ||
| cd .. | ||
| else | ||
| cd python | ||
| make sync | ||
| cd .. | ||
| fi | ||
| log_info "✓ Jumpstarter python installed" |
There was a problem hiding this comment.
[MEDIUM] The install_jumpstarter function here is nearly identical to the one in e2e/compat/setup.sh (lines 166-193). Both implement the same PREBUILT_WHEELS_DIR validation, wheel counting, venv creation, pip install, and make sync fallback. Only the log messages differ slightly. This kind of duplication risks drift where a bug fix in one copy gets missed in the other.
Consider extracting the shared logic into a sourced helper (e.g. e2e/lib/install.sh) and sourcing it from both scripts.
AI-generated, human reviewed
There was a problem hiding this comment.
Fixed (by Cursor) — extracted the shared logic into e2e/lib/install.sh which is now sourced by both e2e/setup-e2e.sh and e2e/compat/setup.sh.
| get_expected_sha256() { | ||
| case "$1" in | ||
| cfssl_linux_amd64) echo "ff4d3a1387ea3e1ee74f4bb8e5ffe9cbab5bee43c710333c206d14199543ebdf" ;; | ||
| cfssl_linux_arm64) echo "bc1a0b3a33ab415f3532af1d52cad7c9feec0156df2069f1cbbb64255485f108" ;; | ||
| cfssl_darwin_amd64) echo "6625b252053d9499bf26102b8fa78d7f675de56703d0808f8ff6dcf43121fa0c" ;; | ||
| cfssl_darwin_arm64) echo "9a38b997ac23bc2eed89d6ad79ea5ae27c29710f66fdabdff2aa16eaaadc30d4" ;; | ||
| cfssljson_linux_amd64) echo "09fbcb7a3b3d6394936ea61eabff1e8a59a8ac3b528deeb14cf66cdbbe9a534f" ;; | ||
| cfssljson_linux_arm64) echo "a389793bc2376116fe2fff996b4a2f772a59a4f65048a5cfb4789b2c0ea4a7c9" ;; | ||
| cfssljson_darwin_amd64) echo "1529a7a163801be8cf7d7a347b0346cc56cc8f351dbc0131373b6fb76bb4ab64" ;; | ||
| yq_linux_amd64) echo "75d893a0d5940d1019cb7cdc60001d9e876623852c31cfc6267047bc31149fa9" ;; | ||
| yq_linux_arm64) echo "90fa510c50ee8ca75544dbfffed10c88ed59b36834df35916520cddc623d9aaa" ;; | ||
| yq_darwin_amd64) echo "6e399d1eb466860c3202d231727197fdce055888c5c7bec6964156983dd1559d" ;; | ||
| yq_darwin_arm64) echo "45a12e64d4bd8a31c72ee1b889e81f1b1110e801baad3d6f030c111db0068de0" ;; | ||
| *) echo "" ;; | ||
| esac |
There was a problem hiding this comment.
[MEDIUM] The get_expected_sha256() function provides checksums for all platform/arch combinations of cfssl (4 entries), yq (4 entries), and cfssljson (3 entries), but is missing the cfssljson_darwin_arm64 entry. This is the only gap across all three tools. On macOS ARM64, the missing hash causes the native arm64 download to silently fail (empty hash), triggering a Rosetta amd64 fallback. If Rosetta is unavailable, it falls back to go install, which bypasses checksum verification entirely.
Add the missing entry by obtaining the SHA256 from the cfssl v1.6.5 GitHub release for cfssljson_1.6.5_darwin_arm64:
cfssljson_darwin_arm64) echo "<sha256-from-release-page>" ;;
AI-generated, human reviewed
There was a problem hiding this comment.
this version does not exist, that's why we don't provide it, but we can add an invalid one just in case.
"does-not-exist" for example.
There was a problem hiding this comment.
Fixed (by Cursor) — added a cfssljson_darwin_arm64) echo "no-prebuilt-binary-available" ;; entry to make it explicit that no binary exists for that platform, rather than silently falling through to an empty string.
| log_info "✓ Dependencies installed" | ||
| } |
There was a problem hiding this comment.
[MEDIUM] The bats infrastructure was removed from the setup scripts (good!), but four .bats test files still exist in the repo: e2e/tests-hooks.bats, e2e/tests-direct-listener.bats, e2e/compat/tests-old-client.bats, and e2e/compat/tests-old-controller.bats. Those files require bats-support and bats-assert which are no longer installed by any setup script, and the test runners (run_ginkgo) never invoke bats. They are effectively dead code now.
Consider removing the four orphaned .bats files as part of this cleanup.
AI-generated, human reviewed
There was a problem hiding this comment.
Fixed (by Cursor) — deleted all 4 orphaned .bats files: e2e/tests-hooks.bats, e2e/tests-direct-listener.bats, e2e/compat/tests-old-client.bats, and e2e/compat/tests-old-controller.bats.
e2e/setup-e2e.sh
Outdated
| kubectl -n dex delete secret dex-tls --ignore-not-found | ||
| kubectl -n dex create secret tls dex-tls \ | ||
| --cert=server.pem \ | ||
| --key=server-key.pem |
There was a problem hiding this comment.
[LOW] The dex-tls secret is recreated with a delete-then-create pattern. If kubectl create fails after the delete (e.g. transient API server error), the old secret is gone with no rollback. Other resources in this same PR already use the safer dry-run-and-apply pattern (see the namespace and service account lines).
For consistency and safety, you could use:
kubectl -n dex create secret tls dex-tls --cert=server.pem --key=server-key.pem --dry-run=client -o yaml | kubectl apply -f -
AI-generated, human reviewed
There was a problem hiding this comment.
Fixed (by Cursor) — replaced the delete-then-create pattern with the safer --dry-run=client -o yaml | kubectl apply -f - approach, consistent with the other resources in the script.
e2e/setup-e2e.sh
Outdated
| fi | ||
|
|
||
| log_info "✓ Dependencies installed" | ||
| if curl -fsSL -o "$dest" "$url" 2>/dev/null && verify_sha256 "$dest" "$expected_hash"; then |
There was a problem hiding this comment.
[LOW] curl -fsSL -o "$dest" "$url" 2>/dev/null discards all curl stderr output (connection errors, DNS failures, TLS errors, HTTP error messages). When all download fallbacks fail and the final go install also fails, the most informative diagnostic (the curl error) has been thrown away.
Consider removing 2>/dev/null and letting -fsSL handle output control, or capturing curl stderr and logging it at warn level on failure.
AI-generated, human reviewed
There was a problem hiding this comment.
Fixed (by Cursor) — removed 2>/dev/null so curl errors are preserved for debugging when all download fallbacks fail.
Split the monolithic e2e jobs into parallel build + test stages to reduce wall-clock time. Container images (controller, operator) and Python wheels are now built in dedicated jobs and shared via GitHub Actions artifacts. Build jobs: - build-controller-image (matrix: amd64/arm64) with cache - build-operator-image (matrix: amd64/arm64) with cache + install.yaml - build-python-wheels (single job) with cache Test jobs download pre-built artifacts instead of rebuilding: - e2e-tests depends on all three build jobs - e2e-compat-old-controller depends only on python-wheels - e2e-compat-old-client depends on controller-image + python-wheels Supporting changes: - controller/Makefile: SKIP_BUILD flag to skip docker-build/build-operator - e2e/setup-e2e.sh: PREBUILT_WHEELS_DIR to install from pre-built wheels - e2e/compat/setup.sh: SKIP_BUILD + PREBUILT_WHEELS_DIR support Made-with: Cursor
ef0001a to
6ed9150
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
e2e/setup-e2e.sh (1)
276-288: Word splitting on$make_parallelis intentional but could be clearer.Shellcheck flags SC2086 because
$make_parallelis unquoted, but this is intentional—the variable contains-j5 --output-sync=targetwhich must split into separate arguments. Consider using an array for clarity, though the current approach works.♻️ Optional: Use array for cleaner argument handling
# Use parallel make on Linux (GNU Make 4+); macOS ships an old make without --output-sync - local make_parallel="" + local -a make_parallel=() if [ "$(uname -s)" = "Linux" ]; then - make_parallel="-j5 --output-sync=target" + make_parallel=(-j5 --output-sync=target) fi log_info "Deploying controller with CA certificate using $METHOD..." if [ "$METHOD" = "operator" ]; then OPERATOR_USE_DEX=true DEX_CA_FILE="$REPO_ROOT/ca.pem" METHOD=$METHOD \ - make -C controller deploy $make_parallel + make -C controller deploy "${make_parallel[@]}" else EXTRA_VALUES="--values $REPO_ROOT/.e2e/values.kind.yaml" METHOD=$METHOD \ - make -C controller deploy $make_parallel + make -C controller deploy "${make_parallel[@]}" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/setup-e2e.sh` around lines 276 - 288, The unquoted variable make_parallel is intentionally relied on for word splitting (to pass multiple make args) but triggers SC2086; replace the string variable with an array (e.g., make_parallel=() or make_parallel=(-j5 --output-sync=target)) and then expand it safely as "${make_parallel[@]}" in the deploy invocation(s) so the make -C controller deploy command receives separate arguments; update both branches where make is called (the operator branch with OPERATOR_USE_DEX/DEX_CA_FILE/METHOD and the else branch with EXTRA_VALUES/METHOD) to use the array expansion.e2e/lib/install.sh (1)
10-10: Add error handling tocdcommands or use subshells.Shellcheck correctly flags these
cdcommands (SC2164). While the sourcing scripts useset -e, the behavior ofset -einside functions is nuanced—command failures in conditionals or pipelines may not trigger an exit. Adding explicit|| exit 1makes the intent clear and protects against subtle bugs.♻️ Proposed fix using subshells (cleaner, avoids cd-back)
install_jumpstarter() { log_info "Installing jumpstarter..." - cd "$REPO_ROOT" + cd "$REPO_ROOT" || exit 1 if [ -n "${PREBUILT_WHEELS_DIR:-}" ]; then if [ ! -d "${PREBUILT_WHEELS_DIR}" ]; then log_error "PREBUILT_WHEELS_DIR is set but directory does not exist: ${PREBUILT_WHEELS_DIR}" exit 1 fi local whl_count whl_count=$(find "${PREBUILT_WHEELS_DIR}" -maxdepth 1 -name '*.whl' | wc -l) if [ "$whl_count" -eq 0 ]; then log_error "PREBUILT_WHEELS_DIR contains no .whl files: ${PREBUILT_WHEELS_DIR}" exit 1 fi log_info "Installing from pre-built wheels in ${PREBUILT_WHEELS_DIR} (${whl_count} wheels)..." - cd python - uv venv .venv --python 3.12 - uv pip install --python .venv/bin/python "${PREBUILT_WHEELS_DIR}"/*.whl - cd .. + ( + cd python || exit 1 + uv venv .venv --python 3.12 + uv pip install --python .venv/bin/python "${PREBUILT_WHEELS_DIR}"/*.whl + ) else - cd python - make sync - cd .. + ( + cd python || exit 1 + make sync + ) fi log_info "✓ Jumpstarter python installed" }Also applies to: 23-23, 28-28
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/lib/install.sh` at line 10, The cd commands (e.g., cd "$REPO_ROOT" and the other cd invocations at the same script) lack explicit error handling and can silently fail under certain set -e semantics; update each cd "$REPO_ROOT" (and the cd calls referenced on lines 23 and 28) to either run in a subshell or append || exit 1 to each cd so the script exits immediately on failure, ensuring failures are not ignored by function/conditional contexts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e.yaml:
- Around line 134-165: The build-python-wheels job produces amd64-only wheels
which are uploaded as the python-wheels artifact and later consumed by arm64
runners (e.g., e2e-tests, e2e-compat-old-controller, e2e-compat-old-client),
causing pip install failures for platform-specific packages like grpcio; fix by
building wheels for both architectures (add a matrix to build-python-wheels
using strategy.matrix with ubuntu-24.04 and ubuntu-24.04-arm64 runners or run
the job twice for amd64 and arm64), ensure the artifact name (python-wheels)
includes the architecture in the key so artifacts don’t collide, and update
consumers (e2e-tests and the compat jobs) to download the matching arch-specific
python-wheels artifact or, alternatively, detect arm64 and install with
--no-binary :all: if you prefer source builds (but ensure build deps are
provided).
---
Nitpick comments:
In `@e2e/lib/install.sh`:
- Line 10: The cd commands (e.g., cd "$REPO_ROOT" and the other cd invocations
at the same script) lack explicit error handling and can silently fail under
certain set -e semantics; update each cd "$REPO_ROOT" (and the cd calls
referenced on lines 23 and 28) to either run in a subshell or append || exit 1
to each cd so the script exits immediately on failure, ensuring failures are not
ignored by function/conditional contexts.
In `@e2e/setup-e2e.sh`:
- Around line 276-288: The unquoted variable make_parallel is intentionally
relied on for word splitting (to pass multiple make args) but triggers SC2086;
replace the string variable with an array (e.g., make_parallel=() or
make_parallel=(-j5 --output-sync=target)) and then expand it safely as
"${make_parallel[@]}" in the deploy invocation(s) so the make -C controller
deploy command receives separate arguments; update both branches where make is
called (the operator branch with OPERATOR_USE_DEX/DEX_CA_FILE/METHOD and the
else branch with EXTRA_VALUES/METHOD) to use the array expansion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 75784fd1-d7c6-46ab-ab7a-505dbe7a6feb
📒 Files selected for processing (10)
.github/workflows/e2e.yamlcontroller/Makefilee2e/compat/setup.she2e/compat/tests-old-client.batse2e/compat/tests-old-controller.batse2e/lib/install.she2e/setup-e2e.she2e/tests-direct-listener.batse2e/tests-hooks.batse2e/tests.bats
💤 Files with no reviewable changes (5)
- e2e/compat/tests-old-controller.bats
- e2e/compat/tests-old-client.bats
- e2e/tests-hooks.bats
- e2e/tests.bats
- e2e/tests-direct-listener.bats
Summary
Parallel build jobs: Controller image, operator image, and Python wheels are now built in separate parallel jobs instead of being rebuilt in every test job
Artifact sharing: Build outputs are saved as GitHub Actions artifacts (container tarballs, wheel files) and downloaded by test jobs that need them
Caching: Each build job uses
actions/cachekeyed on source file hashes, so unchanged artifacts are reused across workflow runsPre-compiled/hashed binaries: Instead of compiling yq/csrf/etc, we try to download the existing binaries first, keeping a hash of the version binaries to avoid supply chain issues.
Architecture
Dependency optimization
e2e-compat-old-controlleronly waits forbuild-python-wheels(uses quay.io for controller)e2e-compat-old-clientonly waits forbuild-controller-image+build-python-wheels(no operator needed)Supporting changes
controller/Makefile: NewSKIP_BUILDflag to skipdocker-buildandbuild-operatorin thedeploytargete2e/setup-e2e.sh: NewPREBUILT_WHEELS_DIRenv var to install from pre-built wheels instead ofmake synce2e/compat/setup.sh:SKIP_BUILDsupport indeploy_new_controller()+PREBUILT_WHEELS_DIRininstall_jumpstarter()All changes are backward-compatible — without env vars set, behavior is identical to before.
Test plan
build-controller-imagejob produces valid tarballs for both architecturesbuild-operator-imagejob produces valid tarballs for both architecturesbuild-python-wheelsproduces installable wheelse2e-testscorrectly loads images and passese2e-compat-old-controllerpasses with pre-built wheelse2e-compat-old-clientpasses with pre-built controller image + wheelsmake e2e-setup && make e2e-runwithout env vars)Made with Cursor