-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
endpoints not configured with MinimizeIPTablesRestore #121362
Comments
for an externalTrafficPolicy Local service which is broken by it it looks like this:
and ends there, missing is the jump to the SVL chain (added after restart):
what is interesting is that the "has no local endpoints" DROP rule is also not added, as that is generated from the full endpoint set and also always written in partial writes and This imo indicates a problem with the
|
/sig network |
/cc @danwinship |
/assign
Is that happening a lot? It shouldn't... |
not a lot, during high cluster activity of O(10) endpoint changes per second about 0.1/s total on 300 node cluster but that is not the problem here, just one of the ways a full sync is triggered which fixes the missing endpoints. |
@juliantaylor I think that can be useful if you increase the logging in the kube-proxy and you print timestamps in your script and upload:
|
the iptables snipped I pasted for the Local service is the main part, no jump to the local chain, no local chain and notably no "no local endpoints" DROP. Just the cluster chain is not included but that's just a bunch of endpoints with the local one also missing, but I can provide the full thing for the service next week. timestamps in the script comparing endpoints to existing iptables configuration are not relevant as its a persistent situation, once it did not configure an endpoint this will not change until something triggers a new endpoint change on the service or it does a full sync. The script can be run at any time and it reports inconsistent configuration. don't see any interesting logging in the related code of kube-proxy (based on the missing DROP proxier is probably fine and the issue is likely in the EndpointsChangeTracker) but I'll see what I can do. |
let me rephrase, it will be important to correlate the partial syncs with the update events and the failures to see if that can give us a hint on why the reconcile loop omits that chain, |
right; the EXT chain is stale. At some point in time there are actually no endpoints, and it writes out a DROP rule and an EXT chain with no jump to the SVL chain. Then an endpoint appears, and it removes the DROP rule but for some reason doesn't rewrite the EXT chain. |
exactly, the reason for that is that the DROP rule is part of the code that is always rewritten, but the EXT chain is skipped if it thinks nothing changed https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1200 this is not specific to the local chains, cluster chains have the same problem, just those are less noticeable as couple nodes missing some endpoints seldom has observable effect. |
( OK, so here are my thoughts so far...
I'm leaning toward (2), though (3) still worries me... |
also, @juliantaylor, @aojea, etc, having other people double-check my logic would be good... |
I can't really judge the logic much, but I have been able to get a kube-proxy v4 log with the situation occurring. I would be curious if you can reproduce this, we have 16 clusters of various sizes and have at least one missing endpoint in each of them, though they are all version 1.26 with the feature gate on and following iptables kube-proxy config
For the logs, the missing endpoint in this case is 100.67.147.16:8443 of a externalTrafficPolicy Cluster service with 3 endpoints. The service has many ports and multiple services point to the same pod (but this feature does not seem to be required for the issue, it is also seen on simpler services) maybe interesting that the service port name
the missing endpoint pod transitions (add 2 hours for same timezone as kube-proxy logs):
full log from pod creation to ready iptables of the port:
|
Interesting, we know pod is ready at 11:41:10. and it was scheduled at 11:40:39
at 11:41:00.250555 there is an update, probablet kubelet setting the podIPs?
two sync after that
and
we missed 28 endpoints , and numNATRules increased this is the sync that actual flushed the service-yyy-helper
I'm worried if |
|
I have so far not managed to reproduce the issue on a 1.27.7 cluster, maybe there was some bugfix I missed. |
|
hm... #114181 made us only add the initial jumps to Nothing else seems relevant... |
It's curious that there happens to be a failed partial sync in the log snippet you attached:
That hash corresponds to the (Also, the fact that it's a service chain that's missing means that the problem isn't (just) a bug in After that fails, it appears to correctly run a full sync:
(There's no explicit indication that this is a full sync, but those numNATChains/numNATRules values are much larger than anywhere else in the log.) So between 11:40:55 and 11:41:10 it manages to lose sync again. :slightly_frowning_face: |
but how can we get more nat rules with the same number of Services and less Endpoints? |
IOW, |
let's wait for feedback on a supported version then, |
/triage accepted |
Sorry for the delay. I have now also seen it on an cluster running version 1.27.7. |
I have deployed an instrumented kube-proxy to show the inconsistent internal state of kube-proxy without any iptables involvement: applied following patch onto the release-1.28 branch (41092d6) It tracks how many endpoints it knows from CategorizeEndpoints when PendingChanges reports changes it records how many it had and when it does not it compares and logs if there is a difference. diff --git a/pkg/proxy/iptables/proxier.go b/pkg/proxy/iptables/proxier.go
index ca781b44b5e..02b357caf88 100644
--- a/pkg/proxy/iptables/proxier.go
+++ b/pkg/proxy/iptables/proxier.go
@@ -160,6 +160,8 @@ type Proxier struct {
serviceChanges *proxy.ServiceChangeTracker
mu sync.Mutex // protects the following fields
+ prevCluster map[string]int
+ prevLocal map[string]int
svcPortMap proxy.ServicePortMap
endpointsMap proxy.EndpointsMap
nodeLabels map[string]string
@@ -273,6 +275,8 @@ func NewProxier(ipFamily v1.IPFamily,
serviceChanges: proxy.NewServiceChangeTracker(newServiceInfo, ipFamily, recorder, nil),
endpointsMap: make(proxy.EndpointsMap),
endpointsChanges: proxy.NewEndpointChangeTracker(hostname, newEndpointInfo, ipFamily, recorder, nil),
+ prevCluster: make(map[string]int),
+ prevLocal: make(map[string]int),
needFullSync: true,
syncPeriod: syncPeriod,
iptables: ipt,
@@ -769,6 +773,7 @@ func (proxier *Proxier) syncProxyRules() {
return
}
+
// The value of proxier.needFullSync may change before the defer funcs run, so
// we need to keep track of whether it was set at the *start* of the sync.
tryPartialSync := !proxier.needFullSync
@@ -1202,6 +1207,25 @@ func (proxier *Proxier) syncProxyRules() {
if tryPartialSync && !serviceChanged.Has(svcName.NamespacedName.String()) && !endpointsChanged.Has(svcName.NamespacedName.String()) {
natChains = skippedNatChains
natRules = skippedNatRules
+ prevendpointscl, _ := proxier.prevCluster[svcName.NamespacedName.String()]
+ prevendpointslo, _ := proxier.prevLocal[svcName.NamespacedName.String()]
+ // klog.ErrorS(nil, "Skipped", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "pcl", prevendpointscl, "local endpoints", len(localEndpoints), "plo", prevendpointslo)
+ if prevendpointscl != len(clusterEndpoints) {
+ klog.ErrorS(nil, "ERROR", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "!=", prevendpointscl)
+ }
+ if prevendpointslo != len(localEndpoints) {
+ klog.ErrorS(nil, "ERROR", "serviceName", svcName.NamespacedName.String(), "local endpoints", len(localEndpoints), "!=", prevendpointslo)
+ }
+ } else {
+ prevendpointscl, ok := proxier.prevCluster[svcName.NamespacedName.String()]
+ prevendpointslo, _ := proxier.prevLocal[svcName.NamespacedName.String()]
+ if !ok {
+ prevendpointscl = -1
+ prevendpointslo = -1
+ }
+ proxier.prevCluster[svcName.NamespacedName.String()] = len(clusterEndpoints)
+ proxier.prevLocal[svcName.NamespacedName.String()] = len(localEndpoints)
+ klog.ErrorS(nil, "Change", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "pcl", prevendpointscl, "local endpoints", len(localEndpoints), "plo", prevendpointslo, "total", len(proxier.prevCluster))
}
// Set up internal traffic handling.
diff --git a/pkg/util/iptables/iptables.go b/pkg/util/iptables/iptables.go
index 0d8135e3297..4e382bfa319 100644
--- a/pkg/util/iptables/iptables.go
+++ b/pkg/util/iptables/iptables.go
@@ -421,16 +421,6 @@ func (runner *runner) restoreInternal(args []string, data []byte, flush FlushFla
fullArgs := append(runner.restoreWaitFlag, args...)
iptablesRestoreCmd := iptablesRestoreCommand(runner.protocol)
klog.V(4).InfoS("Running", "command", iptablesRestoreCmd, "arguments", fullArgs)
- cmd := runner.exec.Command(iptablesRestoreCmd, fullArgs...)
- cmd.SetStdin(bytes.NewBuffer(data))
- b, err := cmd.CombinedOutput()
- if err != nil {
- pErr, ok := parseRestoreError(string(b))
- if ok {
- return pErr
- }
- return fmt.Errorf("%w: %s", err, b)
- }
return nil
}
Running this in a small cluster and restarting pods triggered the error after a couple of minutes of removing pods. here the kyverno/kyverno-svc service has one endpoint replaced, it sees the change from 2->3, then another 2->2 change but then the clusterEndpoints contains 3 endpoints but PendingChanges() reports false so it logs the ERROR.
I may very likely have overlooked some case where storing the previous endpoints needed to be done, but if not this should show that the problem is somewhere in the code involved in determining if an endpoint has changed. |
I'm not going to be able to dig into this myself for the next few days, but if you could get similar logs at V(4), with the additional logging from #122048, then that might provide enough info to figure it out from the logs. (I still can't see how |
Added more logging, including readyiness of endpoints (true/false) in endpointslicecache.go Im pretty sure the issue is a missing synchronization between here the annotated logs, this issue is also trivial to reproduce with a couple minutes patience.
|
yep kubernetes/pkg/proxy/iptables/proxier.go Lines 797 to 802 in 55f2bc1
kubernetes/pkg/proxy/iptables/proxier.go Lines 568 to 588 in 55f2bc1
|
what??? I could swear I checked that there was a lock there 🤦♂️ Adding locks to those functions shouldn't cause any deadlocks, so you could test adding
to the start of I think the right fix though is probably to remove the |
What happened?
To test out the MinimizeIPTablesRestore feature enabled in 1.27 by default we enabled the setting in 1.26 and discovered that it appears to not configure endpoints properly everytime.
While the version in 1.26 is still alpha I have not found any significant changes, bug reports or backports in the changelog regarding this.
It seems sometimes kube-proxy does not see that the endpoints of a service need to be updated despite the api endpointslice having changed. This typically is not that visible with externalTrafficPolicy Cluster services as some other endpoint still is configured, but in the case of low replica services or externalTrafficPolicy Local services this can lead to service disruption due to the only endpoint on the local node not getting configured while other components see the update correctly and direct traffic there.
For example this two pod deployment both pods ready since hours:
and the chain on one node only contains one of the ready endpoints:
this situation remains until either the pod readiness changes, a partial sync fails and it does a full resync or kube-proxy is restartet which also full syncs
The kube-proxy logs show nothing unusual, just lots of partial syncs (which in our cluster take about 1-2 seconds for a partial sync while the full sync takes 20-40seconds).
What did you expect to happen?
On endpoint change the iptables are always configured.
How can we reproduce it (as minimally and precisely as possible)?
We see this in clusters with 1500 services, 14000 pods and 300 nodes on most nodes for a small handful of the services.
The cluster is running v1.26.9
I have written a very crude script to check for the inconsistencies and fix them via a proxy restart:
https://gist.github.com/juliantaylor/996e0255809cb2077b66dc034ec47f55
I imagine reproducing it is tricky, but maybe someone can run the script in their clusters maybe also in a newer version to see is still a problem.
Note due to races it can report errors while stuff in the cluster is changing, but if a deployment behind an ip is stable and still shows up kube-proxy has not configured it properly. (One could at a check on the ready state transition time in the script to avoid this)
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: