OCPBUGS-83604: fix(kubevirt): filter link-local addresses from EndpointSlice endpoints#8264
Conversation
KubeVirt VMs can report link-local addresses (169.254.0.0/16, fe80::/10) in Machine.Status.Addresses. Kubernetes 1.33 added validation that rejects link-local addresses in EndpointSlices, causing the HCCO machine reconciler to fail when creating EndpointSlices for guest LoadBalancer shadow services. The early return on error also prevented processing of subsequent KCCM shadow services. Filter out link-local addresses using netip.Addr.IsLinkLocalUnicast() before populating EndpointSlice endpoints. Signed-off-by: Oren Cohen <ocohen@redhat.com> Assisted-by: Claude
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughThe changes modify the KubeVirt passthrough 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/cc @qinqon |
|
@orenc1: This pull request references Jira Issue OCPBUGS-83604, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@orenc1: This pull request references Jira Issue OCPBUGS-83604, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go (1)
130-130: Consider debug verbosity for per-address skip logs.Line 130 logs every filtered address at
Info;log.V(1).Info(...)would preserve observability with less default log noise in larger clusters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go` at line 130, The per-address skip log currently uses log.Info which is too noisy; change the call to use debug verbosity by replacing log.Info("Skipping link-local address for EndpointSlice", "address", machineAddress.Address, "machine", machine.Name) with log.V(1).Info(...) so the message is emitted at verbosity level 1; update the single invocation located in machine.go where the code logs the skipped link-local address (references: machineAddress.Address and machine.Name) so it retains the same keys and values but uses log.V(1).Info for lower default log noise.control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine_test.go (1)
480-497: Optional: add an all-link-local-only machine scenario.A case where machines have only link-local internal IPs would lock in expected behavior for empty endpoint lists and guard against future regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine_test.go` around lines 480 - 497, Add a new test case in machine_test.go that covers machines with only link-local internal IPs: create a machinesAllLinkLocal (or similar) fixture with machines whose only InternalIPs are link-local, use the existing pairOfVirtualMachines and services (defaultIngressService, kccmService), and add a case named like "When machines have only link-local addresses it should produce no EndpointSlice endpoints" that sets machines: machinesAllLinkLocal, virtualMachines: pairOfVirtualMachines, services: []corev1.Service{defaultIngressService, kccmService}, expectedServices: the same services, expectedIngressEndpointSlices: an empty []discoveryv1.EndpointSlice{}, and hcp: kubevirtHCP; ensure the new case follows the same structure as the existing link-local test so the test harness exercises the empty-endpoints behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine_test.go`:
- Around line 480-497: Add a new test case in machine_test.go that covers
machines with only link-local internal IPs: create a machinesAllLinkLocal (or
similar) fixture with machines whose only InternalIPs are link-local, use the
existing pairOfVirtualMachines and services (defaultIngressService,
kccmService), and add a case named like "When machines have only link-local
addresses it should produce no EndpointSlice endpoints" that sets machines:
machinesAllLinkLocal, virtualMachines: pairOfVirtualMachines, services:
[]corev1.Service{defaultIngressService, kccmService}, expectedServices: the same
services, expectedIngressEndpointSlices: an empty []discoveryv1.EndpointSlice{},
and hcp: kubevirtHCP; ensure the new case follows the same structure as the
existing link-local test so the test harness exercises the empty-endpoints
behavior.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.go`:
- Line 130: The per-address skip log currently uses log.Info which is too noisy;
change the call to use debug verbosity by replacing log.Info("Skipping
link-local address for EndpointSlice", "address", machineAddress.Address,
"machine", machine.Name) with log.V(1).Info(...) so the message is emitted at
verbosity level 1; update the single invocation located in machine.go where the
code logs the skipped link-local address (references: machineAddress.Address and
machine.Name) so it retains the same keys and values but uses log.V(1).Info for
lower default log noise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ed44e3cb-a064-4406-a97d-1708545b5cd3
📒 Files selected for processing (2)
control-plane-operator/hostedclusterconfigoperator/controllers/machine/machine.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/machine/machine_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8264 +/- ##
=======================================
Coverage 35.61% 35.62%
=======================================
Files 767 767
Lines 93333 93336 +3
=======================================
+ Hits 33245 33248 +3
Misses 57399 57399
Partials 2689 2689
🚀 New features to boost your workflow:
|
|
/lgtm |
|
Scheduling tests matching the |
|
/cc @bryan-cox |
|
/verified by e2e |
|
@qinqon: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira cherry-pick release-4.21 |
|
/jira help |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
All 7 failures cascade from just 2 leaf failures ( Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseTransient packageserver startup timeout during hosted cluster bootstrapping (flaky, unrelated to PR) The packageserver container timed out with
The ~2 minute window between the packageserver starting (10:40:05Z) and failing (10:42:07Z) indicates the kube-apiserver was not fully ready despite the availability-prober having completed. After the restart, the kube-apiserver was fully available and the packageserver connected successfully. Why this is NOT related to PR #8264:
Recommendations
Evidence
|
Test Resultse2e-aws
e2e-aks
|
|
/retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill, Nirshal, orenc1, qinqon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira backport release-4.22,release-4.21,release-4.20 |
|
@qinqon: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@orenc1: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@orenc1: Jira Issue Verification Checks: Jira Issue OCPBUGS-83604 Jira Issue OCPBUGS-83604 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #8270 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #8271 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #8272 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
The HCCO machine controller fails to create EndpointSlices for KubeVirt guest LoadBalancer shadow services on Kubernetes 1.33+ clusters. This is because link-local addresses (
169.254.0.0/16,fe80::/10) reported inMachine.Status.Addressesare now rejected by EndpointSlice validation.Problem
KubeVirt VMs can report link-local addresses (e.g.
169.254.0.2,fe80::1) asInternalIPentries inMachine.Status.Addresses. ThereconcileKubevirtPassthroughServicefunction collected allInternalIPaddresses without filtering and placed them into EndpointSlice endpoints.Kubernetes 1.33 introduced validation that rejects link-local addresses in EndpointSlices, causing errors like:
Because the reconciler returns early on error, the failure on the ingress passthrough service also prevented processing of any subsequent KCCM shadow services.
Fix
Filter out link-local addresses using
netip.Addr.IsLinkLocalUnicast()(which covers both IPv4169.254.0.0/16and IPv6fe80::/10) before populating EndpointSlice endpoints. Filtered addresses are logged for observability.Test plan
192.168.1.x,2001:db8::x) and link-local (169.254.0.2,169.254.169.254,fe80::1,fe80::dead:beef) addressesWhat this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes https://redhat.atlassian.net/browse/OCPBUGS-83604
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
Bug Fixes
Tests