OCPBUGS-84955: bootstrap serving certs at hypershift operator startup#8545
OCPBUGS-84955: bootstrap serving certs at hypershift operator startup#8545clebs wants to merge 3 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@clebs: This pull request references Jira Issue OCPBUGS-84955, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR adds EnsureWebhookCerts to bootstrap webhook CA and serving TLS Secrets and invokes it at operator startup when a cert directory is configured. Installer-side generation and propagation of webhook CA bundles is removed: webhook asset structs drop CABundle and built webhook ClientConfig.CABundle values are nil. The manager-serving-cert volume is marked optional and installer tests are updated accordingly. Sequence Diagram(s)sequenceDiagram
participant Operator as hypershift-operator (run)
participant Ensure as webhookcerts.EnsureWebhookCerts
participant KubeAPI as Kubernetes API (Secrets)
Operator->>Ensure: call EnsureWebhookCerts(namespace, serviceName)
Ensure->>KubeAPI: GET Secret ServingCertSecretName
alt serving secret present and valid
KubeAPI-->>Ensure: Secret with cert and key
Ensure-->>Operator: return (no-op)
else missing/invalid
Ensure->>Ensure: GenerateInitialWebhookCerts()
Ensure->>KubeAPI: CREATE/UPDATE CA Secret
Ensure->>KubeAPI: CREATE/UPDATE serving TLS Secret
KubeAPI-->>Ensure: created/updated Secrets
Ensure-->>Operator: return (success)
end
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 402-410: The create+update path can race because when c.Create
returns AlreadyExists you then call c.Update with existingSecret that lacks the
current ResourceVersion; re-fetch the secret from the API server into
existingSecret (e.g., call c.Get(ctx, namespacedName, existingSecret)) inside
the apierrors.IsAlreadyExists branch, copy servingSecret.Data into the freshly
fetched existingSecret, and then call c.Update(ctx, existingSecret); if Update
returns a conflict, retry the get/copy/update a few times or return the conflict
so callers can retry—update the logic around c.Create,
apierrors.IsAlreadyExists, existingSecret, and c.Update accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d0f83117-6c21-45d7-98cf-e7eb54954d13
📒 Files selected for processing (3)
cmd/install/assets/hypershift_operator.gohypershift-operator/controllers/webhookcerts/webhookcerts_controller.gohypershift-operator/main.go
|
/jira refresh |
|
@clebs: This pull request references Jira Issue OCPBUGS-84955, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8545 +/- ##
==========================================
+ Coverage 40.40% 40.42% +0.01%
==========================================
Files 755 755
Lines 93235 93273 +38
==========================================
+ Hits 37675 37706 +31
- Misses 52858 52862 +4
- Partials 2702 2705 +3
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| return writeCertFiles(certDir, servingSecret.Data[corev1.TLSCertKey], servingSecret.Data[corev1.TLSPrivateKeyKey]) | ||
| } | ||
|
|
||
| func writeCertFiles(certDir string, cert, key []byte) error { |
There was a problem hiding this comment.
why do we need to write to disk as well?
There was a problem hiding this comment.
Because when secrets don't exist yet, volume mount has nothing to deliver. Pod started with Optional: true on the volume, so k8s mounts an empty dir. EnsureWebhookCerts creates the secrets via API, but
Kubernetes doesn't hot-reload secret volumes into an already-running container mid-startup. The webhook server needs certs on disk now to bind the TLS listener.
On next pod restart, secret exists and it works as it normally does therefore EnsureWebhookCerts returns early (no-op path).
There was a problem hiding this comment.
controller-runtime certWatcher used by the webhook-server should take care of reloading the certs when the secret exist or changed (it uses fsnotify).
Have you tested without this?
There was a problem hiding this comment.
I have not, I will test it out.
There was a problem hiding this comment.
@muraee I have removed the writing to disk form the bootstrapping as well as the bootstrapping in the CLI.
All installation paths work, PTAL!
|
please also remove the code from the CLI generating those secrets. hypershift/cmd/install/install.go Lines 759 to 768 in ca519ed |
|
@muraee I left the |
7da2d17 to
d1b3a26
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/install/assets/hypershift_operator.go (1)
856-865:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUse a writable volume for
--cert-dir; Secret volumes are always read-only.Marking
manager-serving-certas optional removes the hard dependency on the Secret existing at startup, but Kubernetes Secret-backed volumes remain read-only regardless of theoptionalflag. The operator cannot bootstrap-generate certs and write them to/var/run/secrets/serving-cert. Use a writable volume (e.g.,emptyDir) for--cert-dirand mount the Secret separately if read-only cert access is needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/install/assets/hypershift_operator.go` around lines 856 - 865, The current code appends a Secret volume named "serving-cert" and sets --cert-dir to /var/run/secrets/serving-cert, but Secret volumes are read-only so the operator cannot write bootstrap certs there; change to create a writable volume (e.g., an emptyDir) for the path passed to --cert-dir and, if the Secret must be provided for initial certs, mount the Secret at a different read-only path (or copy from the Secret mount into the emptyDir at startup). Update the logic that appends to volumes (replace or add an emptyDir Volume with the same mount path) and keep the args append of "--cert-dir=/var/run/secrets/serving-cert" pointing to the writable mount, while mounting "manager-serving-cert" Secret elsewhere if needed.
🧹 Nitpick comments (1)
hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go (1)
454-480: ⚡ Quick winAdd a case for “CA exists, serving secret invalid” to lock bootstrap behavior.
This is the highest-risk branch for cert consistency. Add a test ensuring serving cert regeneration uses/pairs correctly with the persisted CA.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go` around lines 454 - 480, Add a new test case to webhookcerts_controller_test.go that sets up an existing CA secret (named CASecretName) with valid cert/key data but an invalid/empty serving secret (ServingCertSecretName with empty Data or missing TLS keys), then call EnsureWebhookCerts(t.Context(), cl, "hypershift", "operator") and assert that: 1) the serving secret is regenerated (updatedServingSecret.Data[corev1.TLSCertKey] and [corev1.TLSPrivateKeyKey] are not empty), and 2) the CA secret remains the persisted CA (updatedCASecret.Data[certs.CASignerCertMapKey] and [certs.CASignerKeyMapKey] are unchanged/unchanged-from-original), ensuring EnsureWebhookCerts uses the existing CA rather than re-bootstraping it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/install/install_test.go`:
- Around line 562-589: The subtest "When webhooks are enabled it should not
render webhook cert secrets" can pass vacuously because no sensitive inputs
(pull secret) are provided; update the test to reuse the temp pull-secret setup
used in the table-driven case and supply it to the Options passed to
RenderHyperShiftOperator (ensure RenderSensitive: true and include the pull
secret field used by Options), then after decoding documents assert that at
least one Secret exists whose name is not "webhook-serving-ca" or
"manager-serving-cert" (and fail if zero non-webhook Secrets are found) so the
test actually exercises sensitive-secret rendering.
In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 448-451: The test is asserting CA secret contents using TLS keys
(corev1.TLSCertKey/corev1.TLSPrivateKeyKey) which can be nil and hide
regressions; update the assertions on updatedCASecret (and compare to caSecret)
to use the CA signer data keys instead (e.g. "ca.crt" and "ca.key" or the
repository's CA signer constants if defined) so you compare the actual CA signer
cert and key rather than TLS keys.
In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 390-406: The current flow calls GenerateInitialWebhookCerts to
produce caSecret and servingSecret and then proceeds even if c.Create(ctx,
caSecret) returns AlreadyExists, which risks persisting a serving cert generated
from a transient CA; change the logic so that when c.Create(ctx, caSecret)
returns an AlreadyExists error you do NOT apply the servingSecret created from
the transient CA. Instead, retrieve the persisted CA secret (e.g., get
existingSecret by name), regenerate the serving cert using the persisted CA (or
use the support/upsert helper to safely reconcile both secrets), and then upsert
the serving secret (use support/upsert to create or update servingSecret) so the
serving cert always matches the stored CA; update references around
GenerateInitialWebhookCerts, caSecret, servingSecret, existingSecret, c.Create
and c.Update accordingly.
---
Outside diff comments:
In `@cmd/install/assets/hypershift_operator.go`:
- Around line 856-865: The current code appends a Secret volume named
"serving-cert" and sets --cert-dir to /var/run/secrets/serving-cert, but Secret
volumes are read-only so the operator cannot write bootstrap certs there; change
to create a writable volume (e.g., an emptyDir) for the path passed to
--cert-dir and, if the Secret must be provided for initial certs, mount the
Secret at a different read-only path (or copy from the Secret mount into the
emptyDir at startup). Update the logic that appends to volumes (replace or add
an emptyDir Volume with the same mount path) and keep the args append of
"--cert-dir=/var/run/secrets/serving-cert" pointing to the writable mount, while
mounting "manager-serving-cert" Secret elsewhere if needed.
---
Nitpick comments:
In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 454-480: Add a new test case to webhookcerts_controller_test.go
that sets up an existing CA secret (named CASecretName) with valid cert/key data
but an invalid/empty serving secret (ServingCertSecretName with empty Data or
missing TLS keys), then call EnsureWebhookCerts(t.Context(), cl, "hypershift",
"operator") and assert that: 1) the serving secret is regenerated
(updatedServingSecret.Data[corev1.TLSCertKey] and [corev1.TLSPrivateKeyKey] are
not empty), and 2) the CA secret remains the persisted CA
(updatedCASecret.Data[certs.CASignerCertMapKey] and [certs.CASignerKeyMapKey]
are unchanged/unchanged-from-original), ensuring EnsureWebhookCerts uses the
existing CA rather than re-bootstraping it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3c7b830c-437f-4fea-a95f-fba702cfa9cc
📒 Files selected for processing (6)
cmd/install/assets/hypershift_operator.gocmd/install/install.gocmd/install/install_test.gohypershift-operator/controllers/webhookcerts/webhookcerts_controller.gohypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.gohypershift-operator/main.go
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go (1)
375-376:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRe-fetch the serving Secret before the
AlreadyExistsupdate path.
EnsureWebhookCertsruns before leader election, and webhook-enabled installs default to two operator replicas. If one pod sees NotFound at Line 376 and another pod creates the Secret first, Line 428 reuses a zero-valueexistingSecret, so theUpdategoes out without the currentresourceVersionand the loser can fail startup.🔧 Minimal fix
if createErr := c.Create(ctx, servingSecret); createErr != nil { if !apierrors.IsAlreadyExists(createErr) { return fmt.Errorf("failed to create serving cert secret: %w", createErr) } + existingSecret = &corev1.Secret{} + if err := c.Get(ctx, client.ObjectKey{Namespace: namespace, Name: ServingCertSecretName}, existingSecret); err != nil { + return fmt.Errorf("failed to get existing serving cert secret: %w", err) + } + existingSecret.Type = servingSecret.Type existingSecret.Data = servingSecret.Data if err := c.Update(ctx, existingSecret); err != nil { return fmt.Errorf("failed to update serving cert secret: %w", err) } }As per coding guidelines, “Use
support/upsert/for safe resource creation and updates”.Also applies to: 424-430
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go` around lines 375 - 376, EnsureWebhookCerts currently reads existingSecret once and may reuse a zero-value existingSecret in the AlreadyExists update path; re-fetch the serving Secret (use c.Get for ServingCertSecretName into existingSecret) immediately before performing the Update in the AlreadyExists branch so you have the latest resourceVersion, or switch to the support/upsert helper to perform a safe create-or-update for ServingCertSecretName; update the code paths that call c.Get/Update (references: EnsureWebhookCerts, existingSecret, ServingCertSecretName, c.Get, Update) to always read the current Secret before updating.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/install/install.go`:
- Around line 820-823: The install-time apply is dropping caBundle from rendered
CRD/webhook manifests (in setupCRDs) which can clear the live bundle when
applying; change setupCRDs (and the similar logic around the other render/apply
block at 949-957) so that when not in render-only/template mode it preserves the
live caBundle by reading the existing resource(s) before apply and copying the
existing .webhooks[*].clientConfig.caBundle (and any CRD conversion webhook
caBundle) into the manifest being applied, or skip removing caBundle entirely
for non-render installs; ensure this logic runs only for real applies (not
render-only) and reference setupCRDs and the apply/render code paths to
implement the fetch-and-merge behavior.
In
`@hypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.go`:
- Around line 454-480: Add a test case that covers the branch where the
persisted CA already exists but the serving cert is missing or empty: create a
pre-seeded CA secret (name CASecretName) with cert/key data
(certs.CASignerCertMapKey and certs.CASignerKeyMapKey), but either omit or
create an empty serving secret (name ServingCertSecretName), call
EnsureWebhookCerts with the fake client, and assert the function does not error
and that the regenerated serving secret contains TLS cert and key
(corev1.TLSCertKey, corev1.TLSPrivateKeyKey) whose chain/signature corresponds
to the existing CA; also assert the CA secret was not overwritten (its cert/key
remain unchanged) to exercise the Create(...CA...) -> AlreadyExists path in
EnsureWebhookCerts.
---
Duplicate comments:
In `@hypershift-operator/controllers/webhookcerts/webhookcerts_controller.go`:
- Around line 375-376: EnsureWebhookCerts currently reads existingSecret once
and may reuse a zero-value existingSecret in the AlreadyExists update path;
re-fetch the serving Secret (use c.Get for ServingCertSecretName into
existingSecret) immediately before performing the Update in the AlreadyExists
branch so you have the latest resourceVersion, or switch to the support/upsert
helper to perform a safe create-or-update for ServingCertSecretName; update the
code paths that call c.Get/Update (references: EnsureWebhookCerts,
existingSecret, ServingCertSecretName, c.Get, Update) to always read the current
Secret before updating.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 13eb008a-01d6-4178-bea4-654aea103a10
📒 Files selected for processing (6)
cmd/install/assets/hypershift_operator.gocmd/install/install.gocmd/install/install_test.gohypershift-operator/controllers/webhookcerts/webhookcerts_controller.gohypershift-operator/controllers/webhookcerts/webhookcerts_controller_test.gohypershift-operator/main.go
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clebs, muraee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
Scheduling tests matching the |
The webhook server requires TLS certs on disk before it can start listening. Previously, certs were only created via `hypershift install` manifests, meaning the operator would fail to start if the serving cert secret was not yet present like it is the case when running `hyperhsift install render` without the `--render-sensitive` flag. This is used in a scenario where the rendered manifests are pushed to a gitops workflow and, therefore, secrets can not be rendered. By adding bootstrap cert generation at startup, the operator is self-sufficient: if the secret exists, the volume mount delivers certs normally; if it is missing or empty, certs are generated, persisted, and written to disk. The secret volume is now marked optional so the pod can start without it. Signed-off-by: Borja Clemente <bclement@redhat.com>
Test the new certs secret bootstrapping logic in the hypershift-operator. Signed-off-by: Borja Clemente <bclement@redhat.com>
Cert secret bootstrapping now lives in the hypershift_operator and the CLI no longer needs to create the secrets. Signed-off-by: Borja Clemente <bclement@redhat.com>
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/lgtm |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws-4-22 e2e-aks-4-22 e2e-aws |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
@clebs: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
All 3 pods started at approximately 09:51:21-25Z (within seconds of each other), all have 0 restarts. The DaemonSet uses The critical question: is this PR-related? The PR changes:
None of these changes touch the Let me now compile the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe During the first sub-test's readiness check, the DaemonSet was transitioning from generation 2 to generation 4 (as shown by The hosted cluster dump (captured after the test) shows all 3 pods ( This is a transient timing/infrastructure flake — one node was slower than others at processing the DaemonSet rolling update (potentially due to image pull latency, volume mount delays, or node-level scheduling contention from parallel test workloads running on the same management cluster). The fact that later readiness checks in the same test ("first check" at line 3214, "second check" at line 3237) all passed confirms the issue was temporary. PR #8545 is not implicated: It modifies webhook cert bootstrapping ( Recommendations
Evidence
|
What this PR does / why we need it:
The webhook server requires TLS certs on disk before it can start listening. Previously, certs were only created via
hypershift installmanifests, meaning the operator would fail to start if the serving cert secret was not yet present like it is the case when runninghyperhsift install renderwithout the--render-sensitiveflag.This is used in a scenario where the rendered manifests are pushed to a gitops workflow and, therefore, secrets can not be rendered.
By adding bootstrap cert generation at startup, the operator is self-sufficient: if the secret exists, the volume mount delivers certs normally; if it is missing or empty, certs are generated, persisted, and written to disk. The secret volume is now marked optional so the pod can start without it.
Which issue(s) this PR fixes:
Followup to OCPBUGS-84955
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Tests