Skip to content

NO-JIRA: slim and harden the shared-ingress HAProxy runtime#8399

Open
tuxerrante wants to merge 7 commits intoopenshift:mainfrom
tuxerrante:tuxerrante/shared-ingress-ubi-micro
Open

NO-JIRA: slim and harden the shared-ingress HAProxy runtime#8399
tuxerrante wants to merge 7 commits intoopenshift:mainfrom
tuxerrante:tuxerrante/shared-ingress-ubi-micro

Conversation

@tuxerrante
Copy link
Copy Markdown

@tuxerrante tuxerrante commented May 2, 2026

What this PR does / why we need it:

This PR reduces the shared-ingress HAProxy image footprint and runtime attack surface without leaving the Red Hat image supply chain.

The runtime image currently starts from the full ubi10/ubi base even though the production container only needs HAProxy and its runtime libraries. This PR switches that flow to a pinned multi-stage build that installs HAProxy into an installroot in a pinned ubi10/ubi builder and copies only that payload into a pinned ubi10/ubi-micro final image.

It also removes the unused socat RPM from the image inputs, hardens the HAProxy container security context in the shared-ingress deployment, and adds a portable smoke test that exercises the same startup contract the controller uses in-cluster.

Problem this solves

The previous image pulled in many packages that are not needed by the shared-ingress runtime path. That inflated the image, increased scanner noise, and left avoidable packages such as Python- and Vim-related content in a critical ingress-facing image.

By trimming the image down to the HAProxy runtime payload we expect:

  • lower image pull and extraction latency during rollout because the runtime image is about 52.6% smaller in this environment
  • lower vulnerability surface and less scanner noise because unused packages are no longer shipped in the final image
  • a more restrictive pod security posture via read-only rootfs, dropped capabilities, allowPrivilegeEscalation: false, and RuntimeDefault seccomp

Local image comparison used for this change

  • Baseline image: registry.access.redhat.com/ubi10/ubi:10.0-1753787353
    • size: 226035753 bytes
    • grype findings: 0 critical, 29 high, 193 medium, 142 low
  • Proposed image built from this branch:
    • size: 107109829 bytes
    • grype findings: 0 critical, 0 high, 45 medium, 22 low

Which issue(s) this PR fixes:

Fixes #8398

Special notes for your reviewer:

  • This is a >200 line change, so I opened shared-ingress: move HAProxy image to pinned UBI micro runtime #8398 first to comply with the contributing guidance for larger PRs.
  • The smoke test is intended to be useful on both Linux and macOS hosts. It uses the repo's container runtime detection and verifies HAProxy startup with the same mounted config path, writable runtime socket path, and read-only root filesystem used by the controller.
  • I used NO-JIRA in the title because this change is linked to a GitHub issue rather than a Jira ticket.

Validation

  • go test ./hypershift-operator/controllers/sharedingress
  • make test-shared-ingress-smoke
  • make pre-commit
  • go test ./hypershift-operator/controllers/nodepool -run TestResolveHAProxyImage

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Added a runnable smoke test and a Makefile target to invoke it.
  • Security

    • Hardened the shared ingress router containers to run with reduced privileges and stricter runtime profiles.
  • Tests

    • Added unit tests validating the router’s hardened security settings.
  • Chores

    • Converted the shared ingress image to a multi-stage build and removed an unnecessary runtime package.
  • Documentation

    • Updated build and local validation instructions for the shared ingress image.

tuxerrante added 2 commits May 2, 2026 10:32
Move the shared-ingress image to a pinned UBI micro runtime and
remove the unused socat package from the locked RPM inputs.

Harden the HAProxy pod security context so the shared-ingress
deployment matches the reduced runtime assumptions.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
Validate the shared-ingress image with the controller's real
HAProxy command, mounted config and runtime directories, and
read-only root filesystem.

Keep the smoke test useful on macOS and Linux hosts by using
the repo's container runtime detection for podman or docker.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot
Copy link
Copy Markdown

@tuxerrante: This pull request explicitly references no jira issue.

Details

In response to this:

What this PR does / why we need it:

This PR reduces the shared-ingress HAProxy image footprint and runtime attack surface without leaving the Red Hat image supply chain.

The runtime image currently starts from the full ubi10/ubi base even though the production container only needs HAProxy and its runtime libraries. This PR switches that flow to a pinned multi-stage build that installs HAProxy into an installroot in a pinned ubi10/ubi builder and copies only that payload into a pinned ubi10/ubi-micro final image.

It also removes the unused socat RPM from the image inputs, hardens the HAProxy container security context in the shared-ingress deployment, and adds a portable smoke test that exercises the same startup contract the controller uses in-cluster.

Problem this solves

The previous image pulled in many packages that are not needed by the shared-ingress runtime path. That inflated the image, increased scanner noise, and left avoidable packages such as Python- and Vim-related content in a critical ingress-facing image.

By trimming the image down to the HAProxy runtime payload we expect:

  • lower image pull and extraction latency during rollout because the runtime image is about 52.6% smaller in this environment
  • lower vulnerability surface and less scanner noise because unused packages are no longer shipped in the final image
  • a more restrictive pod security posture via read-only rootfs, dropped capabilities, allowPrivilegeEscalation: false, and RuntimeDefault seccomp

Local image comparison used for this change

  • Baseline image: registry.access.redhat.com/ubi10/ubi:10.0-1753787353
  • size: 226035753 bytes
  • grype findings: 0 critical, 29 high, 193 medium, 142 low
  • Proposed image built from this branch:
  • size: 107109829 bytes
  • grype findings: 0 critical, 0 high, 45 medium, 22 low

Which issue(s) this PR fixes:

Fixes #8398

Special notes for your reviewer:

  • This is a >200 line change, so I opened shared-ingress: move HAProxy image to pinned UBI micro runtime #8398 first to comply with the contributing guidance for larger PRs.
  • The smoke test is intended to be useful on both Linux and macOS hosts. It uses the repo's container runtime detection and verifies HAProxy startup with the same mounted config path, writable runtime socket path, and read-only root filesystem used by the controller.
  • I used NO-JIRA in the title because this change is linked to a GitHub issue rather than a Jira ticket.

Validation

  • go test ./hypershift-operator/controllers/sharedingress
  • make test-shared-ingress-smoke
  • make pre-commit is still running in the local worktree as this draft is opened
  • go test ./hypershift-operator/controllers/nodepool -run TestResolveHAProxyImage still fails in a clean checkout because support/releaseinfo mocks are not generated yet (releaseinfo.NewMockProviderWithRegistryOverrides missing), which appears to be independent of this change

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 2, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tuxerrante
Copy link
Copy Markdown
Author

/auto-cc
/test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Converted the shared-ingress image to a multi-stage Containerfile using a pinned UBI builder and a pinned UBI-micro runtime, removed socat from RPM inputs and the lockfile, and added a smoke-test script shared-ingress/smoke-test.sh plus a Makefile phony target test-shared-ingress-smoke. Hardened the private-router container SecurityContext (disallow privilege escalation, read-only root filesystem, drop all capabilities, seccomp RuntimeDefault) and set config-generator to drop all capabilities and disallow privilege escalation. Added unit tests asserting these SecurityContext changes.

Sequence Diagram(s)

sequenceDiagram
  actor DevShell as Developer Shell
  participant ContainerRuntime as Container Runtime (podman/docker)
  participant HAProxy as HAProxy (in container)
  participant HostFS as Host filesystem (mounted config & socket)

  DevShell->>ContainerRuntime: build image (multi-stage Containerfile)
  DevShell->>HostFS: create temp config dir and runtime socket dir
  DevShell->>ContainerRuntime: run container (read-only rootfs, mount config & runtime, publish port)
  ContainerRuntime->>HAProxy: start haproxy with mounted config and admin socket
  HAProxy->>HostFS: create admin socket in mounted runtime dir
  DevShell->>HAProxy: poll /haproxy_ready endpoint
  HAProxy-->>DevShell: respond 200 when ready
  DevShell->>ContainerRuntime: inspect container state (expect running)
  DevShell->>ContainerRuntime: stop & remove container (cleanup)
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: slimming down the shared-ingress HAProxy image and hardening its security context, which are the primary objectives of this PR.
Linked Issues check ✅ Passed All linked issue #8398 objectives are met: multi-stage pinned UBI build, socat removal, security hardening, smoke test, and HAProxy runtime contract preservation.
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: Containerfile/rpms updates reduce footprint, security hardening in router.go/tests, Makefile test target, and smoke test support the issue requirements.
Stable And Deterministic Test Names ✅ Passed The pull request contains no Ginkgo tests; all tests use Go's standard testing framework with static, descriptive test names.
Test Structure And Quality ✅ Passed PR contains standard Go testing patterns and Bash integration scripts, not Ginkgo tests, making the Ginkgo-specific custom check not applicable.
Microshift Test Compatibility ✅ Passed PR adds standard Go unit tests in router_test.go, not Ginkgo e2e tests. Custom check targets Ginkgo patterns like It(), Describe(), Context(), or When(), which are absent.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. The changes include unit tests using Go's standard testing.T framework with t.Run() subtests, not Ginkgo, and a standalone bash smoke-test script for local validation. SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR does not introduce new scheduling constraints; all affinity rules, replica count, and PDB configuration pre-date this PR and focus exclusively on security hardening.
Ote Binary Stdout Contract ✅ Passed The PR does not violate the OTE Binary Stdout Contract. Go test code contains no process-level stdout writes, and smoke-test.sh is a standalone bash script, not an OTE binary.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add any Ginkgo e2e tests; it adds standard Go unit tests and a bash smoke test script.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 2, 2026
@tuxerrante
Copy link
Copy Markdown
Author

/label area/hypershift-operator area/documentation area/testing

@openshift-ci openshift-ci Bot requested review from muraee and sdminonne May 2, 2026 08:49
@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.40%. Comparing base (302e42b) to head (36c4903).
⚠️ Report is 34 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8399      +/-   ##
==========================================
+ Coverage   37.19%   37.40%   +0.21%     
==========================================
  Files         750      751       +1     
  Lines       91791    91821      +30     
==========================================
+ Hits        34139    34348     +209     
+ Misses      55002    54838     -164     
+ Partials     2650     2635      -15     
Files with missing lines Coverage Δ
...shift-operator/controllers/sharedingress/router.go 53.50% <100.00%> (+2.13%) ⬆️

... and 39 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.56% <ø> (+0.59%) ⬆️
cpo-hostedcontrolplane 36.48% <ø> (+0.03%) ⬆️
cpo-other 37.73% <ø> (ø)
hypershift-operator 47.89% <100.00%> (+0.04%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
hypershift-operator/controllers/sharedingress/router.go (2)

188-210: ⚡ Quick win

config-generator container has no SecurityContext

While this PR focuses on hardening the private-router container, the config-generator sidecar is left with no SecurityContext, meaning it runs with default (potentially root) permissions and unrestricted capabilities. Consider applying at least allowPrivilegeEscalation: false and capabilities.drop: ["ALL"] here as well, as a follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/sharedingress/router.go` around lines 188 -
210, The config-generator sidecar created by buildConfigGeneratorContainer lacks
a SecurityContext; update the returned function so it sets c.SecurityContext to
a non-nil &corev1.SecurityContext with at least AllowPrivilegeEscalation=false
and Capabilities.Drop containing "ALL" (use corev1.Capability("ALL") or
corev1.Capability values) to mirror hardening applied to the private-router
container; ensure you set the SecurityContext on the container variable c within
buildConfigGeneratorContainer so it is applied when the PodSpec is built.

157-168: Add RunAsNonRoot for Kubernetes Restricted PSS compliance

The SecurityContext correctly sets AllowPrivilegeEscalation=false, ReadOnlyRootFilesystem=true, Capabilities.Drop=["ALL"], and SeccompProfile=RuntimeDefault. However, the Kubernetes "Restricted" pod security standard also requires runAsNonRoot=true. Without it, the container can start as UID 0 if the image lacks a non-root USER directive.

Proposed addition
 		c.SecurityContext = &corev1.SecurityContext{
 			AllowPrivilegeEscalation: ptr.To(false),
+			RunAsNonRoot:             ptr.To(true),
 			ReadOnlyRootFilesystem:   ptr.To(true),
 			Capabilities: &corev1.Capabilities{
 				Drop: []corev1.Capability{
 					"ALL",
 				},
 			},
 			SeccompProfile: &corev1.SeccompProfile{
 				Type: corev1.SeccompProfileTypeRuntimeDefault,
 			},
 		}

First, verify the HAProxy UBI-micro image defines a non-root USER; if it does not, setting RunAsNonRoot=true could cause pod startup failures in environments that strictly enforce pod security standards.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/sharedingress/router.go` around lines 157 -
168, SecurityContext for the router container lacks RunAsNonRoot which is
required for Kubernetes "Restricted" PSS; update the SecurityContext set on
c.SecurityContext in router.go to include RunAsNonRoot: ptr.To(true) (and verify
the HAProxy UBI-micro image uses a non-root USER to avoid startup failures),
i.e., add RunAsNonRoot to the existing SecurityContext alongside
AllowPrivilegeEscalation, ReadOnlyRootFilesystem, Capabilities.Drop and
SeccompProfile so pods comply with Restricted policy.
shared-ingress/smoke-test.sh (1)

73-84: ⚡ Quick win

Drop --rm; the EXIT trap already handles cleanup

With --rm, if HAProxy crashes before the readiness check passes (lines 87–95), the container is auto-removed immediately. The subsequent "${runtime}" logs "${container_name}" calls on lines 91, 99, and 106 then fail silently (swallowed by || true), leaving no diagnostic output for CI. The trap cleanup EXIT on line 18 already guarantees cleanup under all exit paths, making --rm both redundant and actively harmful for debugging.

♻️ Proposed fix
 "${runtime}" run -d \
-    --rm \
     --name "${container_name}" \
     --read-only \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared-ingress/smoke-test.sh` around lines 73 - 84, Remove the dangerous --rm
flag from the container startup command so CI can inspect logs if HAProxy
crashes; locate the docker run invocation that uses "${runtime}" run -d --rm
--name "${container_name}" ... and delete only the --rm token, leaving the rest
(publish, volumes, entrypoint, image_tag, args) intact—cleanup is already
handled by the existing trap cleanup (trap cleanup EXIT), so keep that trap and
the subsequent "${runtime}" logs "${container_name}" calls as-is to preserve
post-failure diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@hypershift-operator/controllers/sharedingress/router.go`:
- Around line 188-210: The config-generator sidecar created by
buildConfigGeneratorContainer lacks a SecurityContext; update the returned
function so it sets c.SecurityContext to a non-nil &corev1.SecurityContext with
at least AllowPrivilegeEscalation=false and Capabilities.Drop containing "ALL"
(use corev1.Capability("ALL") or corev1.Capability values) to mirror hardening
applied to the private-router container; ensure you set the SecurityContext on
the container variable c within buildConfigGeneratorContainer so it is applied
when the PodSpec is built.
- Around line 157-168: SecurityContext for the router container lacks
RunAsNonRoot which is required for Kubernetes "Restricted" PSS; update the
SecurityContext set on c.SecurityContext in router.go to include RunAsNonRoot:
ptr.To(true) (and verify the HAProxy UBI-micro image uses a non-root USER to
avoid startup failures), i.e., add RunAsNonRoot to the existing SecurityContext
alongside AllowPrivilegeEscalation, ReadOnlyRootFilesystem, Capabilities.Drop
and SeccompProfile so pods comply with Restricted policy.

In `@shared-ingress/smoke-test.sh`:
- Around line 73-84: Remove the dangerous --rm flag from the container startup
command so CI can inspect logs if HAProxy crashes; locate the docker run
invocation that uses "${runtime}" run -d --rm --name "${container_name}" ... and
delete only the --rm token, leaving the rest (publish, volumes, entrypoint,
image_tag, args) intact—cleanup is already handled by the existing trap cleanup
(trap cleanup EXIT), so keep that trap and the subsequent "${runtime}" logs
"${container_name}" calls as-is to preserve post-failure diagnostics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ad066b4a-d1b9-4c8b-81db-02437bec4a82

📥 Commits

Reviewing files that changed from the base of the PR and between 302e42b and 52c1790.

📒 Files selected for processing (8)
  • Makefile
  • hypershift-operator/controllers/sharedingress/router.go
  • hypershift-operator/controllers/sharedingress/router_test.go
  • shared-ingress/Containerfile
  • shared-ingress/README.md
  • shared-ingress/rpms.in.yaml
  • shared-ingress/rpms.lock.yaml
  • shared-ingress/smoke-test.sh
💤 Files with no reviewable changes (2)
  • shared-ingress/rpms.in.yaml
  • shared-ingress/rpms.lock.yaml

Resolve the shared-ingress RPM lockfile as an empty-root transaction so
Konflux hermetic builds prefetch the full dependency closure needed by
`dnf --installroot /rootfs`.

Add the UBI BaseOS repository and disable weak dependencies in the RPM
input so the lockfile matches the installroot build used by the
Containerfile.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@tuxerrante
Copy link
Copy Markdown
Author

tuxerrante commented May 2, 2026

Root cause for the failing shared-ingress Konflux build was the mismatch between the new empty-root dnf --installroot /rootfs transaction and the old layered-image RPM lock input.

I pushed a follow-up commit that:

  • resolves shared-ingress/rpms.lock.yaml as a bare-root transaction
  • adds the UBI BaseOS repo to shared-ingress/rpms.in.yaml
  • disables weak dependencies in the RPM input so the lockfile matches the Containerfile

Local verification on the updated branch:

  • podman build -f shared-ingress/Containerfile -t localhost/hypershift-shared-ingress:fix ./shared-ingress
  • make test-shared-ingress-smoke

Create the merged-/usr lib64 symlink before running the empty-root DNF
transaction so the filesystem RPM does not fail when /rootfs/lib64 is
present as a directory in buildah-based CI.

Keep the installroot layout aligned with the installed UBI 10 filesystem
package while preserving the smaller ubi-micro runtime image.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@tuxerrante
Copy link
Copy Markdown
Author

I pushed a second follow-up after reproducing the remaining installroot failure locally.

The key detail is that the filesystem RPM aborts the transaction if /rootfs/lib64 exists as a directory instead of the merged-/usr symlink it expects. Precreating /rootfs/lib64 -> usr/lib64 before dnf --installroot /rootfs makes the same transaction succeed locally.

Fresh local verification on the updated head:

  • podman build -f shared-ingress/Containerfile -t localhost/hypershift-shared-ingress:fix2 ./shared-ingress
  • make test-shared-ingress-smoke

Konflux injects local RPM repositories from the lockfile, but those generated
repo definitions do not carry an importable gpgkey. Disable DNF repo GPG checks
for the installroot transaction so the hermetic build can consume the
checksum-verified local RPM mirror.

Retain the pinned package list and the smaller ubi-micro runtime image while
matching the offline repo behavior exercised in CI.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@tuxerrante
Copy link
Copy Markdown
Author

I pushed another targeted follow-up after reproducing the Konflux-style RPM path locally.

What the local hermetic repro showed:

  • the generated local RPM repos work, but dnf --installroot fails before the transaction when repo GPG checking is enabled
  • the Hermeto-style .repo entries point at the local mirror and keep gpgcheck=1, but they do not provide an importable gpgkey
  • the same offline install succeeds once DNF is invoked with --nogpgcheck

Fresh local verification on the updated head:

  • podman build -f shared-ingress/Containerfile -t localhost/hypershift-shared-ingress:fix3 ./shared-ingress
  • make test-shared-ingress-smoke
  • offline repro against a local Hermeto-style RPM mirror built from shared-ingress/rpms.lock.yaml

Harden the config-generator sidecar by dropping all capabilities and disabling
privilege escalation, and keep smoke-test containers around long enough to
preserve logs when startup fails.

Avoid changing RunAsNonRoot here because the current shared-ingress runtime
image does not yet declare a non-root USER and that would risk breaking
existing deployments.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@tuxerrante
Copy link
Copy Markdown
Author

I pushed one more small follow-up for the outstanding CodeRabbit nitpicks.

Included in this update:

  • removed --rm from shared-ingress/smoke-test.sh so failed smoke-test runs preserve container logs until the EXIT trap cleans up
  • added a minimal SecurityContext to the config-generator sidecar (allowPrivilegeEscalation: false, capabilities.drop: ["ALL"])
  • added unit coverage for the sidecar hardening

I intentionally did not add RunAsNonRoot to the HAProxy container in this follow-up. The current shared-ingress runtime image path still does not declare a non-root USER, so forcing RunAsNonRoot here would be a behavior change that could break pod startup rather than a no-risk hardening tweak.

Fresh local verification on the updated head:

  • go test ./hypershift-operator/controllers/sharedingress
  • make test-shared-ingress-smoke
  • make lint-fix
  • make pre-commit reached the repo verify step and stopped only because the tree was intentionally dirty with these staged changes

@tuxerrante tuxerrante marked this pull request as ready for review May 2, 2026 10:17
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2026
@openshift-ci openshift-ci Bot requested a review from sjenning May 2, 2026 10:18
@muraee
Copy link
Copy Markdown
Contributor

muraee commented May 4, 2026

/test e2e-aks

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2051225930411544576 | Cost: $2.7352929999999986 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@tuxerrante: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks 2cd1c60 link true /test e2e-aks

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 4, 2026

Test Results

e2e-aks

Failed Tests

Total failed tests: 3

  • TestAzureScheduler
  • TestAzureScheduler/ValidateHostedCluster
  • TestAzureScheduler/ValidateHostedCluster/EnsureNoCrashingPods

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I now have all the evidence needed. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

TestAzureScheduler/ValidateHostedCluster/EnsureNoCrashingPods (0.11s)
  util.go:817: Container manager in pod cluster-api-84798b49d5-pkpjr has a restartCount > 0 (1)

Summary

The sole failure is TestAzureScheduler/ValidateHostedCluster/EnsureNoCrashingPods — a validation check that asserts no HCP control-plane containers have restarted. The cluster-api pod's manager container restarted once at 09:28 UTC due to a transient TLS handshake timeout connecting to the Kubernetes API server (net/http: TLS handshake timeout). The container recovered immediately and has been running normally since 09:29 UTC. All 3 reported failures (TestAzureScheduler, ValidateHostedCluster, EnsureNoCrashingPods) are the same cascading parent-child test hierarchy. All other tests (TestCreateCluster, TestUpgradeControlPlane, TestNodePool, TestAutoscaling, TestCreateClusterCustomConfig) passed EnsureNoCrashingPods. This failure is unrelated to the PR changes — the PR modifies only shared-ingress/HAProxy files, and the shared-ingress router pods are healthy with zero restarts.

Root Cause

The cluster-api-84798b49d5-pkpjr pod in namespace e2e-clusters-r4k8z-azure-scheduler-v2qs5 experienced a single container restart due to a transient network issue during pod startup:

  1. The manager container started at 09:28:46Z and attempted to contact the Kubernetes API server at https://10.0.0.1:443/version
  2. The TLS handshake timed out after ~10 seconds: "Unable to start manager" err="failed to get the Kubernetes version: Get \"https://10.0.0.1:443/version?timeout=32s\": net/http: TLS handshake timeout"
  3. The container exited with code 1 at 09:28:56Z
  4. Kubernetes restarted the container, which successfully started at 09:29:12Z and has been running healthy ever since

This is a transient infrastructure issue — a brief network blip during pod initialization that prevented the TLS handshake from completing within the timeout window. The cluster-api component has zero crash toleration in the EnsureNoCrashingPods test, so even a single restart (even if recovered) causes a test failure.

This failure is completely unrelated to PR #8399, which only modifies:

  • shared-ingress/Containerfile (HAProxy base image change)
  • hypershift-operator/controllers/sharedingress/router.go (security context hardening)
  • shared-ingress/rpms.in.yaml / rpms.lock.yaml (RPM dependencies)
  • Other shared-ingress files (tests, docs, smoke-test)

The shared-ingress router pods (router-96545bb56-4pphg, router-96545bb56-dtqvz) are both healthy with restartCount: 0 and phase: Running.

Recommendations
  1. Retry the job — This is a transient infrastructure flake. The cluster-api container restart was caused by a one-time TLS handshake timeout during pod startup, not by any code change in this PR.

  2. No code changes needed — The PR's shared-ingress/HAProxy changes are not involved in this failure. The failing component (cluster-api manager container) is entirely separate from the shared-ingress stack.

  3. Consider adding cluster-api to crash tolerations — The cluster-api manager container has zero crash toleration in EnsureNoCrashingPods. A single transient restart (e.g., TLS timeout during startup) causes a hard test failure even though the container recovers immediately. Adding a toleration of 1 (similar to aws-ebs-csi-driver-controller or network-node-identity) would reduce false-positive flakes from transient network issues. This would be a separate improvement unrelated to this PR.

Evidence
Evidence Detail
Failed test TestAzureScheduler/ValidateHostedCluster/EnsureNoCrashingPods
Failing pod cluster-api-84798b49d5-pkpjr in namespace e2e-clusters-r4k8z-azure-scheduler-v2qs5
Container manager, restartCount: 1, exitCode: 1
Crash reason net/http: TLS handshake timeout connecting to https://10.0.0.1:443/version
Crash time Started 09:28:46Z, failed 09:28:56Z, recovered 09:29:12Z
Recovery Container is ready: true, running since 09:29:12Z
Other tests All 10+ other EnsureNoCrashingPods checks PASSED (different HCP namespaces)
PR scope Only shared-ingress/HAProxy files modified — zero overlap with cluster-api
Shared-ingress health Both router pods healthy, restartCount: 0, phase: Running
Total results 342 tests, 40 skipped, 3 failures (all same TestAzureScheduler hierarchy)

Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

Comment thread shared-ingress/Containerfile
Comment thread hypershift-operator/controllers/sharedingress/router.go
Comment thread hypershift-operator/controllers/sharedingress/router_test.go Outdated
Comment thread hypershift-operator/controllers/sharedingress/router_test.go Outdated
Comment thread hypershift-operator/controllers/sharedingress/router_test.go Outdated
Comment thread hypershift-operator/controllers/sharedingress/router_test.go Outdated
Document the hermetic RPM and router capability decisions inline so the
hardening changes remain easier to review and maintain.

Refactor the shared-ingress controller tests to use table-driven gomega
assertions and the shared podspec container lookup helper requested in
review.

Signed-off-by: Alessandro Affinito <aaffinit@redhat.com>
Assisted-by: GPT-5.4 (via Cursor)
Made-with: Cursor
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tuxerrante
Once this PR has been reviewed and has the lgtm label, please ask for approval from jparrill. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shared-ingress: move HAProxy image to pinned UBI micro runtime

5 participants