WIP: Add k3s-based CAPZ e2e test (no IPI cluster needed)#79611
WIP: Add k3s-based CAPZ e2e test (no IPI cluster needed)#79611marek-veber wants to merge 5 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a k3s-based CAPZ e2e test: a command script to bootstrap/run k3s and tests, a step reference with nested-podman and resources, a workflow to run/post the step, CI job registration in stolostron config, and an OWNERS file for the step registry. ChangesCAPZ k3s e2e testing
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels: 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test all |
|
@marek-veber, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh (1)
52-64: ⚡ Quick winSurface k3s diagnostics on readiness timeout.
When k3s fails to become ready, the script exits with only a fatal message and no diagnostics, which will make debugging the very thing this PR is validating much harder. Consider capturing k3s server output to a file and dumping it (plus
kubectl get nodes/kubectl get pods -A) into${ARTIFACT_DIR}before exiting.🔧 Proposed change
k3s server \ --disable=traefik \ --snapshotter=native \ --write-kubeconfig="${K3S_KUBECONFIG}" \ --write-kubeconfig-mode=644 \ --kubelet-arg="eviction-hard=imagefs.available<1%,nodefs.available<1%" \ --kubelet-arg="eviction-minimum-reclaim=imagefs.available=1%,nodefs.available=1%" \ - & + >"${ARTIFACT_DIR}/k3s-server.log" 2>&1 & K3S_PID=$! ... if [ "${READY}" != true ]; then echo "[k3s] FATAL: k3s did not become ready within 300s" + KUBECONFIG="${K3S_KUBECONFIG}" kubectl get nodes -o wide >"${ARTIFACT_DIR}/k3s-nodes.txt" 2>&1 || true + KUBECONFIG="${K3S_KUBECONFIG}" kubectl get pods -A -o wide >"${ARTIFACT_DIR}/k3s-pods.txt" 2>&1 || true exit 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh` around lines 52 - 64, When the readiness loop using KUBECONFIG="${K3S_KUBECONFIG}" and the READY flag times out, capture diagnostics before exiting: redirect the k3s server logs into a file under ${ARTIFACT_DIR} (e.g., k3s-server.log), and run KUBECONFIG="${K3S_KUBECONFIG}" kubectl get nodes --no-headers and kubectl get pods -A, saving their outputs into separate files in ${ARTIFACT_DIR}; place this dump in the failure branch where the script currently checks if [ "${READY}" != true ] and before the echo/exit so artifacts are preserved for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ci-operator/config/stolostron/capi-tests/stolostron-capi-tests-main.yaml`:
- Around line 19-57: The CI config added two new test entries ("capz-e2e" and
"capz-k3s-e2e") under tests which requires regenerating downstream ProwJob
artifacts; run the repository's generation target (make update) to regenerate
files under ci-operator/jobs/, review and stage the changed generated artifacts,
and commit those updated artifacts alongside the change that introduced the new
tests so the new capz-e2e and capz-k3s-e2e jobs are present in the committed
outputs.
In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh`:
- Line 16: The script currently enables xtrace via the literal line "set -o
xtrace"; remove that default and ensure the script starts with the required safe
shell flags "set -euo pipefail" instead, and if you need -x for debugging, gate
enabling xtrace behind a DEBUG (or CI_DEBUG) environment variable (e.g., test if
DEBUG is set and then call "set -o xtrace"), making sure kubeconfig handling and
kubectl/k3s calls no longer run with -x by default.
---
Nitpick comments:
In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh`:
- Around line 52-64: When the readiness loop using
KUBECONFIG="${K3S_KUBECONFIG}" and the READY flag times out, capture diagnostics
before exiting: redirect the k3s server logs into a file under ${ARTIFACT_DIR}
(e.g., k3s-server.log), and run KUBECONFIG="${K3S_KUBECONFIG}" kubectl get nodes
--no-headers and kubectl get pods -A, saving their outputs into separate files
in ${ARTIFACT_DIR}; place this dump in the failure branch where the script
currently checks if [ "${READY}" != true ] and before the echo/exit so artifacts
are preserved for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2b9b4e37-2673-4efd-a25f-80be41b10a9e
📒 Files selected for processing (4)
ci-operator/config/stolostron/capi-tests/stolostron-capi-tests-main.yamlci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.shci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-ref.yamlci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-workflow.yaml
| export USE_KUBECONFIG="${K3S_KUBECONFIG}" | ||
|
|
||
| source openshift-ci/capz-test-env.sh | ||
| set -o xtrace |
There was a problem hiding this comment.
Avoid enabling xtrace by default.
Per coding guidelines, step-registry command scripts should run with set -euo pipefail and only enable -x (here set -o xtrace) when actively debugging. Since this script also exports/handles a k3s kubeconfig path and runs kubectl/k3s commands, leaving xtrace on by default risks leaking command arguments into job logs and increases log noise. Consider removing it (or gating it behind a debug flag) before merging out of WIP.
🔧 Proposed change
-source openshift-ci/capz-test-env.sh
-set -o xtrace
+source openshift-ci/capz-test-env.sh
+# Enable tracing only when debugging:
+# set -o xtraceAs per coding guidelines: "Step registry script files must use set -euo pipefail (without -x) as default and only enable -x when actively debugging".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| set -o xtrace | |
| source openshift-ci/capz-test-env.sh | |
| # Enable tracing only when debugging: | |
| # set -o xtrace |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh` at
line 16, The script currently enables xtrace via the literal line "set -o
xtrace"; remove that default and ensure the script starts with the required safe
shell flags "set -euo pipefail" instead, and if you need -x for debugging, gate
enabling xtrace behind a DEBUG (or CI_DEBUG) environment variable (e.g., test if
DEBUG is set and then call "set -o xtrace"), making sure kubeconfig handling and
kubectl/k3s calls no longer run with -x by default.
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marek-veber The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@marek-veber, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/test all |
|
/retest |
|
/pj-rehearse capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@marek-veber: job(s): capz-k3s-e2e either don't exist or were not found to be affected, and cannot be rehearsed |
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
… teardown ref - Add missing OWNERS file for k3s-e2e step registry - Add grace_period to ref (required when using trap) - Remove non-existent capz-test-teardown reference - Remove non-existent capz-e2e job (missing step registrations) - Fix cluster profile: azure4 -> openshift-org-azure - Remove undeclared DEPLOY_CHARTS env override (set in script) - Regenerate ci-operator config and job configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
CI pods run as non-root with read-only /usr/local/bin, causing curl write error (exit 23). Use /tmp and prepend to PATH. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh`:
- Around line 32-34: Replace downloading to and prepending the world-writable
/tmp with a secure dedicated temp directory and make the download fail fast:
create a private dir via mktemp -d, download K3S from K3S_URL into that dir
using curl with fail-on-error flags (e.g. -f/--fail and -L), make the downloaded
binary executable (the file referenced as /tmp/k3s in the diff), and update PATH
to include only that temp dir (not /tmp) so the temp dir precedes system bins;
ensure you also clean up the temp dir on exit or error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d62306fe-879d-47bc-8ee4-b5a8d49cf0f6
📒 Files selected for processing (1)
ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh
| curl -sLo /tmp/k3s "${K3S_URL}" | ||
| chmod +x /tmp/k3s | ||
| export PATH="/tmp:${PATH}" |
There was a problem hiding this comment.
Avoid prepending world-writable /tmp to PATH.
On Line 34, putting /tmp first in PATH allows command shadowing risk (for tools invoked later like kubectl/go). Use a dedicated temp dir and fail fast on HTTP errors.
🔧 Proposed change
-curl -sLo /tmp/k3s "${K3S_URL}"
-chmod +x /tmp/k3s
-export PATH="/tmp:${PATH}"
+K3S_BIN_DIR="$(mktemp -d)"
+K3S_BIN="${K3S_BIN_DIR}/k3s"
+curl -fsSLo "${K3S_BIN}" "${K3S_URL}"
+chmod 0755 "${K3S_BIN}"
+export PATH="${K3S_BIN_DIR}:${PATH}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| curl -sLo /tmp/k3s "${K3S_URL}" | |
| chmod +x /tmp/k3s | |
| export PATH="/tmp:${PATH}" | |
| K3S_BIN_DIR="$(mktemp -d)" | |
| K3S_BIN="${K3S_BIN_DIR}/k3s" | |
| curl -fsSLo "${K3S_BIN}" "${K3S_URL}" | |
| chmod 0755 "${K3S_BIN}" | |
| export PATH="${K3S_BIN_DIR}:${PATH}" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh`
around lines 32 - 34, Replace downloading to and prepending the world-writable
/tmp with a secure dedicated temp directory and make the download fail fast:
create a private dir via mktemp -d, download K3S from K3S_URL into that dir
using curl with fail-on-error flags (e.g. -f/--fail and -L), make the downloaded
binary executable (the file referenced as /tmp/k3s in the diff), and update PATH
to include only that temp dir (not /tmp) so the temp dir precedes system bins;
ensure you also clean up the temp dir on exit or error.
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
CI pods run as random non-root UID. k3s requires --rootless flag in this case. Also install slirp4netns dependency, use writable data-dir, and detect early if the k3s process dies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse abort |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-stolostron-capi-tests-main-capz-k3s-e2e |
|
@marek-veber: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@marek-veber: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
capz-k3s-e2eProw job that uses k3s as the management cluster instead of provisioning a full IPI OpenShift clusternested_podmancapabilities (SYS_ADMIN + NET_ADMIN)Purpose
Draft PR to validate whether
nested_podmanprovides enough capabilities for k3s to start inside a Prow test pod.New files
ci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-commands.sh— starts k3s, deploys controllers, runs e2e testsci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-ref.yaml— step ref withnested_podman: trueci-operator/step-registry/capz/test/k3s-e2e/capz-test-k3s-e2e-workflow.yaml— workflow without IPIWhy k3s instead of Kind?
Kind-in-Podman fails because containerd inside the Kind node can't do overlay mounts 3 levels deep (Docker → Podman → containerd → overlay =
failed to mount rootfs component: invalid argument). k3s runs everything as a single process with--snapshotter=native, avoiding nested container issues entirely.🤖 Generated with Claude Code
Summary by CodeRabbit
This PR adds a lightweight Prow CI job to stolostron/capi-tests: capz-k3s-e2e. Instead of provisioning an IPI OpenShift workload cluster, the job starts a k3s management cluster inside the test pod (requires nested_podman with SYS_ADMIN + NET_ADMIN), deploys CAPI/CAPZ/ASO controllers into it, and runs CAPZ e2e phases 01–08. The goal is to avoid IPI provisioning (~40 minutes saved) and reduce Azure compute costs. The PR is a draft to validate whether nested_podman provides sufficient capabilities for k3s to start reliably inside Prow test pods.
Practical impact / what changed
Repositories / CI surface affected
Rationale
Status / notes