Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoints not configured with MinimizeIPTablesRestore #121362

Closed
juliantaylor opened this issue Oct 19, 2023 · 29 comments · Fixed by #122204
Closed

endpoints not configured with MinimizeIPTablesRestore #121362

juliantaylor opened this issue Oct 19, 2023 · 29 comments · Fixed by #122204
Assignees
Labels
area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@juliantaylor
Copy link
Contributor

juliantaylor commented Oct 19, 2023

What happened?

To test out the MinimizeIPTablesRestore feature enabled in 1.27 by default we enabled the setting in 1.26 and discovered that it appears to not configure endpoints properly everytime.
While the version in 1.26 is still alpha I have not found any significant changes, bug reports or backports in the changelog regarding this.

It seems sometimes kube-proxy does not see that the endpoints of a service need to be updated despite the api endpointslice having changed. This typically is not that visible with externalTrafficPolicy Cluster services as some other endpoint still is configured, but in the case of low replica services or externalTrafficPolicy Local services this can lead to service disruption due to the only endpoint on the local node not getting configured while other components see the update correctly and direct traffic there.

For example this two pod deployment both pods ready since hours:

$ kubectl -n ns get pods pod1 pod2-o yaml | grep -B 2 Ready
      lastTransitionTime: "2023-10-19T10:21:19Z"
      status: "True"
      type: Ready
--
      lastTransitionTime: "2023-10-19T10:21:19Z"
      status: "True"
      type: ContainersReady
--
      lastTransitionTime: "2023-10-19T10:20:19Z"
      status: "True"
      type: Ready
--
      lastTransitionTime: "2023-10-19T10:20:19Z"
      status: "True"
      type: ContainersReady

and the chain on one node only contains one of the ready endpoints:

KUBE-SERVICES -d 10.78.44.163/32 -p tcp -m comment --comment "svc:serviceport loadbalancer IP" -m tcp --dport 4000 -j KUBE-EXT-APW7J2FZSAAP4Z43
KUBE-EXT-APW7J2FZSAAP4Z43 -m comment --comment "masquerade traffic for svc:serviceport external destinations" -j KUBE-MARK-MASQ
KUBE-EXT-APW7J2FZSAAP4Z43 -j KUBE-SVC-APW7J2FZSAAP4Z43
KUBE-SVC-APW7J2FZSAAP4Z43 ! -s 100.64.0.0/13 -d 100.72.16.186/32 -p tcp -m comment --comment "svc:serviceport cluster IP" -m tcp --dport 4000 -j KUBE-MARK-MASQ
KUBE-SVC-APW7J2FZSAAP4Z43 -j KUBE-SEP-JDEM23RCGZCIYUJI
KUBE-SEP-JDEM23RCGZCIYUJI -s 100.64.131.130/32 -j KUBE-MARK-MASQ
KUBE-SEP-JDEM23RCGZCIYUJI -p tcp -m tcp -j DNAT --to-destination 100.64.131.130:4000

this situation remains until either the pod readiness changes, a partial sync fails and it does a full resync or kube-proxy is restartet which also full syncs

The kube-proxy logs show nothing unusual, just lots of partial syncs (which in our cluster take about 1-2 seconds for a partial sync while the full sync takes 20-40seconds).

What did you expect to happen?

On endpoint change the iptables are always configured.

How can we reproduce it (as minimally and precisely as possible)?

We see this in clusters with 1500 services, 14000 pods and 300 nodes on most nodes for a small handful of the services.
The cluster is running v1.26.9
I have written a very crude script to check for the inconsistencies and fix them via a proxy restart:
https://gist.github.com/juliantaylor/996e0255809cb2077b66dc034ec47f55
I imagine reproducing it is tricky, but maybe someone can run the script in their clusters maybe also in a newer version to see is still a problem.
Note due to races it can report errors while stuff in the cluster is changing, but if a deployment behind an ip is stable and still shows up kube-proxy has not configured it properly. (One could at a check on the ready state transition time in the script to avoid this)

./check-iptables-nodes.sh $(kubectl get nodes | awk '{print $1}'
kworker-be-prod-iz2-001
kworker-be-prod-iz2-003
kworker-be-prod-iz2-004
10.78.44.54 cep 1 2 
Cluster service ERROR
10.78.44.53 cep 1 2 
Cluster service ERROR
kubectl -n kube-system delete pod kube-proxy-f3cabbf48bbe893419216de2439c9ae609ae6c8a-llvqk
kworker-be-prod-iz2-005
kworker-be-prod-iz2-006
10.78.44.53 cep 1 2 
Cluster service ERROR
kubectl -n kube-system delete pod kube-proxy-f3cabbf48bbe893419216de2439c9ae609ae6c8a-5sd9r
kworker-be-prod-iz2-008
10.78.44.54 cep 1 2 
Cluster service ERROR
10.78.44.49 cep 1 2 
Cluster service ERROR
10.78.44.53 cep 1 2 
Cluster service ERROR
10.78.44.82 cep 1 2 
Cluster service ERROR

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.9", GitCommit:"d1483fdf7a0578c83523bc1e2212a606a44fd71d", GitTreeState:"clean", BuildDate:"2023-09-13T11:25:26Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

bare metal

OS version

linux 5.15 flatcar stable kube-proxy hyperkube image

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@juliantaylor juliantaylor added the kind/bug Categorizes issue or PR as related to a bug. label Oct 19, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 19, 2023
@juliantaylor
Copy link
Contributor Author

juliantaylor commented Oct 19, 2023

for an externalTrafficPolicy Local service which is broken by it it looks like this:

KUBE-SERVICES -d 10.78.45.54/32 -p tcp -m comment --comment "ghelper/traefik2-service:http loadbalancer IP" -m tcp --dport 80 -j KUBE-EXT-O5TNKVKDWS2IHWWC
KUBE-EXT-O5TNKVKDWS2IHWWC -s 100.64.0.0/13 -m comment --comment "pod traffic for ghelper/traefik2-service:http external destinations" -j KUBE-SVC-O5TNKVKDWS2IHWWCKUBE-EXT-O5TNKVKDWS2IHWWC -m comment --comment "masquerade LOCAL traffic for ghelper/traefik2-service:http external destinations" -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ
KUBE-EXT-O5TNKVKDWS2IHWWC -m comment --comment "route LOCAL traffic for ghelper/traefik2-service:http external destinations" -m addrtype --src-type LOCAL -j KUBE-SVC-O5TNKVKDWS2IHWWC

and ends there, missing is the jump to the SVL chain (added after restart):

... this is missing
KUBE-EXT-O5TNKVKDWS2IHWWC -j KUBE-SVL-O5TNKVKDWS2IHWWC

what is interesting is that the "has no local endpoints" DROP rule is also not added, as that is generated from the full endpoint set and also always written in partial writes and hasExternalEndpoints is true.

This imo indicates a problem with the endpointsChanged function:

if tryPartialSync && !serviceChanged.Has(svcName.NamespacedName.String()) && !endpointsChanged.Has(svcName.NamespacedName.String()) {

@uablrek
Copy link
Contributor

uablrek commented Oct 19, 2023

/sig network
/area kube-proxy

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 19, 2023
@aojea
Copy link
Member

aojea commented Oct 19, 2023

/cc @danwinship

@danwinship
Copy link
Contributor

/assign

a partial sync fails and it does a full resync

Is that happening a lot? It shouldn't...

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Oct 20, 2023

/assign

a partial sync fails and it does a full resync

Is that happening a lot? It shouldn't...

not a lot, during high cluster activity of O(10) endpoint changes per second about 0.1/s total on 300 node cluster but that is not the problem here, just one of the ways a full sync is triggered which fixes the missing endpoints.

@aojea
Copy link
Member

aojea commented Oct 20, 2023

@juliantaylor I think that can be useful if you increase the logging in the kube-proxy and you print timestamps in your script and upload:

  • kube-proxy logs with verbosity -v4
  • output of your script with timestamps
  • one of the problematic iptables dumps

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Oct 20, 2023

the iptables snipped I pasted for the Local service is the main part, no jump to the local chain, no local chain and notably no "no local endpoints" DROP. Just the cluster chain is not included but that's just a bunch of endpoints with the local one also missing, but I can provide the full thing for the service next week.

timestamps in the script comparing endpoints to existing iptables configuration are not relevant as its a persistent situation, once it did not configure an endpoint this will not change until something triggers a new endpoint change on the service or it does a full sync. The script can be run at any time and it reports inconsistent configuration.

don't see any interesting logging in the related code of kube-proxy (based on the missing DROP proxier is probably fine and the issue is likely in the EndpointsChangeTracker) but I'll see what I can do.

@aojea
Copy link
Member

aojea commented Oct 20, 2023

let me rephrase, it will be important to correlate the partial syncs with the update events and the failures to see if that can give us a hint on why the reconcile loop omits that chain,

@danwinship
Copy link
Contributor

no jump to the local chain, no local chain and notably no "no local endpoints" DROP.

right; the EXT chain is stale. At some point in time there are actually no endpoints, and it writes out a DROP rule and an EXT chain with no jump to the SVL chain. Then an endpoint appears, and it removes the DROP rule but for some reason doesn't rewrite the EXT chain.

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Oct 20, 2023

no jump to the local chain, no local chain and notably no "no local endpoints" DROP.

right; the EXT chain is stale. At some point in time there are actually no endpoints, and it writes out a DROP rule and an EXT chain with no jump to the SVL chain. Then an endpoint appears, and it removes the DROP rule but for some reason doesn't rewrite the EXT chain.

exactly, the reason for that is that the DROP rule is part of the code that is always rewritten, but the EXT chain is skipped if it thinks nothing changed https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1200
so the allEndpoints is filled correctly it sets hasExternalEndpoints=true and it removes the DROP rule, but then endpoints.Changed from pendingChanges is false and it skips writing the local chain.

this is not specific to the local chains, cluster chains have the same problem, just those are less noticeable as couple nodes missing some endpoints seldom has observable effect.

@danwinship
Copy link
Contributor

This imo indicates a problem with the endpointsChanged function:

(endpointsChanged is a set; the function there is just endpointsChanged.Has(). So the problem would be "a problem with the contents of the endpointsChanged set".)

OK, so here are my thoughts so far...

  1. If there's a bug in EndpointsChangeTracker.PendingChanges(), such that endpointsChanged ends up not including some changed endpoints despite the tracker having noticed that they changed and triggering a resync, then we'd fail to rewrite the rules for those endpoints. Alternatively, the same could be true for ServiceChangeTracker.PendingChanges()/servicesChanged.
    • This seems unlikely: the logic for deciding what's in endpointsChanged/servicesChanged is pretty closely tied to the logic for triggering syncs. It seems like if the tracker noticed the change at all, it ought to show up in endpointsChanged/servicesChanged. Or at least, if there was a bug here, it ought to be pretty obvious, not something rare/subtle.
  2. If there's a bug in EndpointsChangeTracker more generally such that it fails to trigger syncs on some EndpointSlice changes, then the endpoint also wouldn't show up in endpointsChanged and so we wouldn't rewrite its chains. (Or again, the problem could be with ServiceChangeTracker rather than EndpointsChangeTracker.)
    • In this case there would have been a bug even in the pre-MinimizeIPTablesRestore behavior, but it's possible we just never noticed it, since any later change would pick up the previously-ignored Service/EndpointSlice change as well, so the effect would only be that the update was slightly delayed, not missed entirely.
    • If there was a missing mutex lock somewhere then there could be a data race that causes an update to be missed?
  3. If set of "usable" endpoints for a Service (i.e., one of the return values of CategorizeEndpoints) changes without the Service itself or any of its EndpointSlices having changed (in a relevant way), then we might need to rewrite the service/endpoint chains but not do so.
    • e.g., if either (a) hasEndpoints changed from false to true, or (b) localEndpoints changed from empty to non-empty, without servicesChanged or endpointsChanged including the service, then we would end up with the set of rules you quoted above.
    • At first I was thinking that must be it, but after looking at the code some more, I don't think that can actually happen...
      • The return values of CategorizeEndpoints will change for a Cluster policy service if the result of canUseTopology() changes, but that can only change if either (a) the service changes (in which case it should be in servicesChanged), (b) at least one endpoint changes (in which case it should be in endpointsChanged), or (c) the node labels change (in which case OnNodeUpdate would force a full resync).
      • The return values of CategorizeEndpoints can change when a service flips between using Ready endpoints and using Serving-Terminating endpoints. e.g., if endpoint A is Local,Ready, and endpoint B is Local,Serving,Terminating, then the returned localEndpoints would be [A]. If endpoint A then terminates, then localEndpoints would change to [A, B] even though B didn't change. But that shouldn't actually cause any problems, because endpointsChanged tracks whether any endpoint for the service changed, and in this case A changed, so we should be ok.

I'm leaning toward (2), though (3) still worries me...

@danwinship
Copy link
Contributor

I'm leaning toward (2), though (3) still worries me...

also, @juliantaylor, @aojea, etc, having other people double-check my logic would be good...

@juliantaylor
Copy link
Contributor Author

I can't really judge the logic much, but I have been able to get a kube-proxy v4 log with the situation occurring.
It is reasonably reproducible also in smaller clusters, the smallest one it is seen was only 5 nodes and 133 services though still very random not easily triggered.

I would be curious if you can reproduce this, we have 16 clusters of various sizes and have at least one missing endpoint in each of them, though they are all version 1.26 with the feature gate on and following iptables kube-proxy config

    iptables:
      masqueradeAll: false
      masqueradeBit: 14
      minSyncPeriod: 0s
      syncPeriod: 30s

For the logs, the missing endpoint in this case is 100.67.147.16:8443 of a externalTrafficPolicy Cluster service with 3 endpoints. The service has many ports and multiple services point to the same pod (but this feature does not seem to be required for the issue, it is also seen on simpler services)

maybe interesting that the service port name portName="yyy-helper/service-yyy:https is printed twice sometimes, maybe two calls to endpointsMapFromEndpointInfo at the same time?
e.g. at these timestamps first and last line, this is also where the 100.67.147.16:8443 became ready but the double print occurs earlier too.

I1024 11:41:10.117628       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https" endpoints=[100.64.156.127:8443 100.65.146.185:8443 100.67.147.16:8443]
I1024 11:41:10.117664       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https-clientcert" endpoints=[100.64.156.127:8445 100.65.146.185:8445 100.67.147.16:8445]
I1024 11:41:10.117720       1 config.go:262] "Calling handler.OnEndpointSliceUpdate"
I1024 11:41:10.117792       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:http" endpoints=[100.64.156.127:8000 100.65.146.185:8000 100.67.147.16:8000]
i1024 11:41:10.117815       1 endpointslicecache.go:373] "setting endpoints for service port name" portname="yyy-helper/service-yyy:https" endpoints=[100.64.156.127:8443 100.65.146.185:8443 100.67.147.16:8443]

the missing endpoint pod transitions (add 2 hours for same timezone as kube-proxy logs):

    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:40:39Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:41:10Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:41:10Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:40:39Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - ...
      started: true
      state:
        running:
          startedAt: "2023-10-24T09:40:41Z"
    hostIP: 10.242.195.5
    phase: Running
    podIP: 100.67.147.16
    startTime: "2023-10-24T09:40:39Z"

full log from pod creation to ready
kube-proxy2.log

iptables of the port:

-A KUBE-EXT-JOL5W3UOU7OAGK2C -m comment --comment "masquerade traffic for service-yyy-helper/service-yyy:https external destinations" -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "service-yyy-helper/service-yyy:https" -m tcp --dport 31129 -j KUBE-EXT-JOL5W3UOU7OAGK2C
-A KUBE-SEP-PTAGJMY4KK4H3NT2 -s 100.64.156.127/32 -m comment --comment "service-yyy-helper/service-yyy:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-PTAGJMY4KK4H3NT2 -p tcp -m comment --comment "service-yyy-helper/service-yyy:https" -m tcp -j DNAT --to-destination 100.64.156.127:8443
-A KUBE-SEP-SQ3HMH4RORFZEHLZ -s 100.65.146.185/32 -m comment --comment "service-yyy-helper/service-yyy:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-SQ3HMH4RORFZEHLZ -p tcp -m comment --comment "service-yyy-helper/service-yyy:https" -m tcp -j DNAT --to-destination 100.65.146.185:8443
-A KUBE-SERVICES -d 100.72.43.157/32 -p tcp -m comment --comment "service-yyy-helper/service-yyy:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-JOL5W3UOU7OAGK2C
-A KUBE-SERVICES -d 10.90.33.119/32 -p tcp -m comment --comment "service-yyy-helper/service-yyy:https loadbalancer IP" -m tcp --dport 443 -j KUBE-EXT-JOL5W3UOU7OAGK2C
-A KUBE-SVC-JOL5W3UOU7OAGK2C ! -s 100.64.0.0/13 -d 100.72.43.157/32 -p tcp -m comment --comment "service-yyy-helper/service-yyy:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-JOL5W3UOU7OAGK2C -m comment --comment "service-yyy-helper/service-yyy:https -> 100.64.156.127:8443" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-PTAGJMY4KK4H3NT2
-A KUBE-SVC-JOL5W3UOU7OAGK2C -m comment --comment "service-yyy-helper/service-yyy:https -> 100.65.146.185:8443" -j KUBE-SEP-SQ3HMH4RORFZEHLZ

@aojea
Copy link
Member

aojea commented Oct 24, 2023

Interesting, we know pod is ready at 11:41:10. and it was scheduled at 11:40:39

    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:41:10Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:41:10Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2023-10-24T09:40:39Z"
      status: "True"

at 11:41:00.250555 there is an update, probablet kubelet setting the podIPs?

I1024 11:41:00.250555       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https" endpoints=[100.64.156.127:8443 100.65.146.185:8443 100.65.146.206:8443 100.67.147.16:8443]
I1024 11:41:00.250594       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https-clientcert" endpoints=[100.64.156.127:8445 100.65.146.185:8445 100.65.146.206:8445 100.67.147.16:8445]
I1024 11:41:00.250726       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:http" endpoints=[100.64.156.127:8000 100.65.146.185:8000 100.67.147.16:8000]
I1024 11:41:00.250754       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https" endpoints=[100.64.156.127:8443 100.65.146.185:8443 100.67.147.16:8443]
I1024 11:41:00.250772       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="yyy-helper/service-yyy:https-clientcert" endpoints=[100.64.156.127:8445 100.65.146.185:8445 100.67.147.16:8445]

two sync after that

I1024 11:41:00.251027       1 proxier.go:849] "Syncing iptables rules"
I1024 11:41:00.349216       1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=406 numFilterChains=5 numFilterRules=3 numNATChains=19 numNATRules=373

and

1024 11:41:00.395052       1 proxier.go:849] "Syncing iptables rules"
I1024 11:41:00.445478       1 iptables.go:358] running iptables-save [-t nat]
I1024 11:41:00.487788       1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=378 numFilterChains=5 numFilterRules=3 numNATChains=123 numNATRules=614

we missed 28 endpoints , and numNATRules increased

this is the sync that actual flushed the service-yyy-helper

I1024 11:41:10.118470       1 proxier.go:849] "Syncing iptables rules"
I1024 11:41:10.170750       1 iptables.go:358] running iptables-save [-t nat]
I1024 11:41:10.212309       1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=378 numFilterChains=5 numFilterRules=3 numNATChains=19 numNATRules=375
I1024 11:41:10.212357       1 iptables.go:423] running iptables-restore [-w 5 -W 100000 --noflush --counters]
I1024 11:41:10.235796       1 proxier.go:1558] "Network programming" endpoint="yyy-helper/service-yyy-uipool" elapsed=0.235719074
I1024 11:41:10.235866       1 proxier.go:1558] "Network programming" endpoint="yyy-helper/service-yyy-dashboard" elapsed=0.235813667
I1024 11:41:10.235920       1 proxier.go:1558] "Network programming" endpoint="yyy-helper/service-yyy" elapsed=0.235872374

I'm worried if minSyncPeriod: 0s can be triggering some race , can you reproduce this issue with minSyncPeriod: 1s ?

@juliantaylor
Copy link
Contributor Author

minSyncPeriod: 0s seems to just mean default sync period of 1s, the v1alpha1 config we are using sets it to 1 if MinSyncPeriod==0, it is also logging:

   61 I1024 11:24:47.589597       1 proxier.go:258] "Using iptables mark for masquerade" ipFamily=IPv4 mark="0x00004000"
   62 I1024 11:24:47.590257       1 proxier.go:303] "Iptables sync params" ipFamily=IPv4 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
   63 I1024 11:24:47.590296       1 proxier.go:313] "Iptables supports --random-fully" ipFamily=IPv4
   64 I1024 11:24:47.590354       1 proxier.go:258] "Using iptables mark for masquerade" ipFamily=IPv6 mark="0x00004000"
   65 I1024 11:24:47.590375       1 proxier.go:303] "Iptables sync params" ipFamily=IPv6 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
   66 I1024 11:24:47.590396       1 proxier.go:313] "Iptables supports --random-fully" ipFamily=IPv6
   67 I1024 11:24:47.590544       1 server.go:656] "Version info" version="v1.26.9"
   68 I1024 11:24:47.590555       1 server.go:658] "Golang settings" GOGC="" GOMAXPROCS="2" GOTRACEBACK=""
   69 I1024 11:24:47.590562       1 oom_linux.go:65] attempting to set "/proc/self/oom_score_adj" to "-999"
   70 I1024 11:24:47.595070       1 server.go:804] "GetConntrackMax: using conntrack-min"
   71 I1024 11:24:47.595123       1 conntrack.go:52] "Setting nf_conntrack_max" nf_conntrack_max=5242880

@juliantaylor
Copy link
Contributor Author

I have so far not managed to reproduce the issue on a 1.27.7 cluster, maybe there was some bugfix I missed.

@danwinship
Copy link
Contributor

we missed 28 endpoints , and numNATRules increased

numNATChains and numNATRules in that log message is the number of nat chains/rules we actually wrote, so this just reflects a difference in what got synced in that run; it doesn't necessarily imply we lost anything

@danwinship
Copy link
Contributor

I have so far not managed to reproduce the issue on a 1.27.7 cluster, maybe there was some bugfix I missed.

hm...

#114181 made us only add the initial jumps to KUBE- chains (from OUTPUT, PREROUTING, etc) on full syncs, but if anything that should make it more likely to fail...
#114757 included a fix to the EndpointSliceCache but it should have only affected the accuracy of metrics.
#115138 fixed a race condition but it was mostly theoretical, and only affected the first sync after kube-proxy startup anyway

Nothing else seems relevant...

@danwinship
Copy link
Contributor

danwinship commented Oct 24, 2023

It's curious that there happens to be a failed partial sync in the log snippet you attached:

E1024 11:40:55.597686       1 proxier.go:1546] "Failed to execute iptables-restore" err=<
	exit status 2: iptables-restore v1.8.7 (nf_tables): Chain 'KUBE-SVC-UUAZGEQE42ATFAGR' does not exist
	Error occurred at line: 367
	Try `iptables-restore -h' or 'iptables-restore --help' for more information.
 >

That hash corresponds to the "monitoring/prometheus-backend-1:prometheus" service whose endpoints were updated just before that. There are no other references to this service in this log snippet, but we know that kube-proxy had already seen the v1.Service at some point, because it wouldn't have generated a service chain name if it had only seen the endpoints. (So the problem isn't "it got the EndpointSlice update but not the Service update".)

(Also, the fact that it's a service chain that's missing means that the problem isn't (just) a bug in EndpointsChangeTracker or CategorizeEndpoints, so now we're down to either (a) there are bugs in both ServiceChangeTracker and EndpointsChangeTracker, or (b) the bug isn't in the trackers, it's in proxier.go itself somewhere.)

After that fails, it appears to correctly run a full sync:

I1024 11:40:55.683233       1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=410 numFilterChains=5 numFilterRules=4 numNATChains=616 numNATRules=1744

(There's no explicit indication that this is a full sync, but those numNATChains/numNATRules values are much larger than anywhere else in the log.)

So between 11:40:55 and 11:41:10 it manages to lose sync again. :slightly_frowning_face:

@aojea
Copy link
Member

aojea commented Oct 24, 2023

numNATChains and numNATRules in that log message is the number of nat chains/rules we actually wrote, so this just reflects a difference in what got synced in that run; it doesn't necessarily imply we lost anything

I1024 11:41:00.349216 1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=406 numFilterChains=5 numFilterRules=3 numNATChains=19 numNATRules=373
I1024 11:41:00.487788 1 proxier.go:1529] "Reloading service iptables data" numServices=186 numEndpoints=378 numFilterChains=5 numFilterRules=3 numNATChains=123 numNATRules=614

but how can we get more nat rules with the same number of Services and less Endpoints?

@danwinship
Copy link
Contributor

numServices and numEndpoints are the total number of Services and Endpoints. numNATRules is the number of updated nat rules, which is not directly proportional to numServices and numEndpoints. It's more proportional to the change in numServices and numEndpoints since the last sync.

IOW, numNATRules is larger in the second sync because it had to rewrite all of the KUBE-SVC-* chains that used to jump to the 28 endpoints that just got deleted.

@aojea
Copy link
Member

aojea commented Oct 25, 2023

I have so far not managed to reproduce the issue on a 1.27.7 cluster, maybe there was some bugfix I missed.

let's wait for feedback on a supported version then,

@thockin
Copy link
Member

thockin commented Oct 26, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 26, 2023
@juliantaylor
Copy link
Contributor Author

Sorry for the delay. I have now also seen it on an cluster running version 1.27.7.
End result looks the same as already posted for the 1.26 clusters, no errors or warnings in the proxy logs but unfortunately in the cluster it was seen was not set to -v4 logging.
It appears to be harder to reproduce in 1.27 but we also only have a handful of smaller clusters running on this version so far.

@juliantaylor
Copy link
Contributor Author

I have deployed an instrumented kube-proxy to show the inconsistent internal state of kube-proxy without any iptables involvement:

applied following patch onto the release-1.28 branch (41092d6)

It tracks how many endpoints it knows from CategorizeEndpoints when PendingChanges reports changes it records how many it had and when it does not it compares and logs if there is a difference.
It removes any writing of iptables as I don't believe the problem lies there.

diff --git a/pkg/proxy/iptables/proxier.go b/pkg/proxy/iptables/proxier.go
index ca781b44b5e..02b357caf88 100644
--- a/pkg/proxy/iptables/proxier.go
+++ b/pkg/proxy/iptables/proxier.go
@@ -160,6 +160,8 @@ type Proxier struct {
        serviceChanges   *proxy.ServiceChangeTracker
 
        mu           sync.Mutex // protects the following fields
+       prevCluster   map[string]int
+       prevLocal   map[string]int
        svcPortMap   proxy.ServicePortMap
        endpointsMap proxy.EndpointsMap
        nodeLabels   map[string]string
@@ -273,6 +275,8 @@ func NewProxier(ipFamily v1.IPFamily,
                serviceChanges:           proxy.NewServiceChangeTracker(newServiceInfo, ipFamily, recorder, nil),
                endpointsMap:             make(proxy.EndpointsMap),
                endpointsChanges:         proxy.NewEndpointChangeTracker(hostname, newEndpointInfo, ipFamily, recorder, nil),
+               prevCluster: make(map[string]int),
+               prevLocal: make(map[string]int),
                needFullSync:             true,
                syncPeriod:               syncPeriod,
                iptables:                 ipt,
@@ -769,6 +773,7 @@ func (proxier *Proxier) syncProxyRules() {
                return
        }
 
+
        // The value of proxier.needFullSync may change before the defer funcs run, so
        // we need to keep track of whether it was set at the *start* of the sync.
        tryPartialSync := !proxier.needFullSync
@@ -1202,6 +1207,25 @@ func (proxier *Proxier) syncProxyRules() {
                if tryPartialSync && !serviceChanged.Has(svcName.NamespacedName.String()) && !endpointsChanged.Has(svcName.NamespacedName.String()) {
                        natChains = skippedNatChains
                        natRules = skippedNatRules
+                       prevendpointscl, _ := proxier.prevCluster[svcName.NamespacedName.String()]
+                       prevendpointslo, _ := proxier.prevLocal[svcName.NamespacedName.String()]
+                       // klog.ErrorS(nil, "Skipped", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "pcl", prevendpointscl, "local endpoints", len(localEndpoints), "plo", prevendpointslo)
+                       if prevendpointscl != len(clusterEndpoints) {
+                               klog.ErrorS(nil, "ERROR", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "!=", prevendpointscl)
+                       }
+                       if prevendpointslo != len(localEndpoints) {
+                               klog.ErrorS(nil, "ERROR", "serviceName", svcName.NamespacedName.String(), "local endpoints", len(localEndpoints), "!=", prevendpointslo)
+                       }
+               } else {
+                       prevendpointscl, ok := proxier.prevCluster[svcName.NamespacedName.String()]
+                       prevendpointslo, _ := proxier.prevLocal[svcName.NamespacedName.String()]
+                       if !ok {
+                               prevendpointscl = -1
+                               prevendpointslo = -1
+                       }
+                       proxier.prevCluster[svcName.NamespacedName.String()] = len(clusterEndpoints)
+                       proxier.prevLocal[svcName.NamespacedName.String()] = len(localEndpoints)
+                       klog.ErrorS(nil, "Change", "serviceName", svcName.NamespacedName.String(), "cluster endpoints", len(clusterEndpoints), "pcl", prevendpointscl, "local endpoints", len(localEndpoints), "plo", prevendpointslo, "total", len(proxier.prevCluster))
                }
 
                // Set up internal traffic handling.
diff --git a/pkg/util/iptables/iptables.go b/pkg/util/iptables/iptables.go
index 0d8135e3297..4e382bfa319 100644
--- a/pkg/util/iptables/iptables.go
+++ b/pkg/util/iptables/iptables.go
@@ -421,16 +421,6 @@ func (runner *runner) restoreInternal(args []string, data []byte, flush FlushFla
        fullArgs := append(runner.restoreWaitFlag, args...)
        iptablesRestoreCmd := iptablesRestoreCommand(runner.protocol)
        klog.V(4).InfoS("Running", "command", iptablesRestoreCmd, "arguments", fullArgs)
-       cmd := runner.exec.Command(iptablesRestoreCmd, fullArgs...)
-       cmd.SetStdin(bytes.NewBuffer(data))
-       b, err := cmd.CombinedOutput()
-       if err != nil {
-               pErr, ok := parseRestoreError(string(b))
-               if ok {
-                       return pErr
-               }
-               return fmt.Errorf("%w: %s", err, b)
-       }
        return nil
 }

Running this in a small cluster and restarting pods triggered the error after a couple of minutes of removing pods.

here the kyverno/kyverno-svc service has one endpoint replaced, it sees the change from 2->3, then another 2->2 change but then the clusterEndpoints contains 3 endpoints but PendingChanges() reports false so it logs the ERROR.
This issue resolves itself on another endpoint change later when another pod gets replaced.

I1204 15:32:28.540114       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:28.543746       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
I1204 15:32:28.549370       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=7 numNATRules=437
I1204 15:32:28.549554       1 proxier.go:787] "SyncProxyRules complete" elapsed="9.561408ms"
I1204 15:32:28.549619       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:28.551729       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
I1204 15:32:28.556484       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=7 numNATRules=437
I1204 15:32:28.556675       1 proxier.go:787] "SyncProxyRules complete" elapsed="7.102925ms"
I1204 15:32:29.557676       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:29.558139       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:32:29.561630       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
I1204 15:32:29.567774       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=431 numFilterChains=5 numFilterRules=3 numNATChains=10 numNATRules=444
I1204 15:32:29.568080       1 proxier.go:787] "SyncProxyRules complete" elapsed="10.576178ms"
I1204 15:32:30.793481       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:30.794592       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:32:30.795115       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
I1204 15:32:30.803951       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=10 numNATRules=444
I1204 15:32:30.804208       1 proxier.go:787] "SyncProxyRules complete" elapsed="10.944511ms"
I1204 15:32:31.804911       1 proxier.go:798] "Syncing iptables rules"
I1204 15:32:31.815414       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=4 numNATRules=430
I1204 15:32:31.815678       1 proxier.go:787] "SyncProxyRules complete" elapsed="10.831821ms"
I1204 15:32:59.233856       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:59.234621       1 proxier.go:1211] "ERROR" serviceName="kyverno/kyverno-svc" cluster endpoints=3 !==2
E1204 15:32:59.238182       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=3 pcl=2 local endpoints=0 plo=0 total=128
I1204 15:32:59.244395       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=8 numNATRules=440
I1204 15:32:59.244761       1 proxier.go:787] "SyncProxyRules complete" elapsed="11.143102ms"
I1204 15:32:59.244820       1 proxier.go:798] "Syncing iptables rules"
E1204 15:32:59.246195       1 proxier.go:1211] "ERROR" serviceName="kyverno/kyverno-svc" cluster endpoints=3 !==2
I1204 15:32:59.255245       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=4 numNATRules=430
I1204 15:32:59.255417       1 proxier.go:787] "SyncProxyRules complete" elapsed="10.623778ms"
I1204 15:40:19.040518       1 proxier.go:798] "Syncing iptables rules"
E1204 15:40:19.042532       1 proxier.go:1229] "Change" serviceName="monitoring/data-plane-availability-metallb-ui-anycast" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:19.044354       1 proxier.go:1211] "ERROR" serviceName="kyverno/kyverno-svc" cluster endpoints=3 !==2
I1204 15:40:19.051940       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=7 numNATRules=436
I1204 15:40:19.052232       1 proxier.go:787] "SyncProxyRules complete" elapsed="11.879354ms"
I1204 15:40:19.052387       1 proxier.go:798] "Syncing iptables rules"
E1204 15:40:19.053774       1 proxier.go:1229] "Change" serviceName="monitoring/data-plane-availability-metallb-int-anycast" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:19.054761       1 proxier.go:1211] "ERROR" serviceName="kyverno/kyverno-svc" cluster endpoints=3 !==2
E1204 15:40:19.055256       1 proxier.go:1229] "Change" serviceName="monitoring/data-plane-availability-metallb-ui" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:19.056460       1 proxier.go:1229] "Change" serviceName="monitoring/data-plane-availability-metallb-int" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
I1204 15:40:19.061533       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=3 numNATChains=13 numNATRules=448
I1204 15:40:19.061781       1 proxier.go:787] "SyncProxyRules complete" elapsed="9.519217ms"
I1204 15:40:20.063693       1 proxier.go:798] "Syncing iptables rules"
E1204 15:40:20.064325       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:20.064614       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.064890       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.065027       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.065179       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.065353       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:20.065901       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.066076       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.066421       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:20.067272       1 proxier.go:1211] "ERROR" serviceName="kyverno/kyverno-svc" cluster endpoints=3 !==2
E1204 15:40:20.067671       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:20.068012       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.068596       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.068832       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:20.069420       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.069869       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.070198       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.070356       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.070904       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.071150       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=2 local endpoints=0 plo=0 total=128
E1204 15:40:20.071761       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=0 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:20.071871       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:20.072413       1 proxier.go:1229] "Change" serviceName="monitoring/goldpinger" cluster endpoints=7 pcl=8 local endpoints=0 plo=0 total=128
E1204 15:40:20.072709       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=1 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:20.072935       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=0 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:20.073070       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
I1204 15:40:20.079807       1 proxier.go:1529] "Reloading service iptables data" numServices=213 numEndpoints=433 numFilterChains=5 numFilterRules=5 numNATChains=73 numNATRules=581
I1204 15:40:20.080244       1 proxier.go:787] "SyncProxyRules complete" elapsed="18.136775ms"
I1204 15:40:21.081374       1 proxier.go:798] "Syncing iptables rules"
E1204 15:40:21.081720       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:21.082278       1 proxier.go:1229] "Change" serviceName="otherservices" cluster endpoints=2 pcl=1 local endpoints=0 plo=0 total=128
E1204 15:40:21.082643       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=2 pcl=3 local endpoints=0 plo=0 total=128
E1204 15:40:21.083149       1 proxier.go:1229] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128

I may very likely have overlooked some case where storing the previous endpoints needed to be done, but if not this should show that the problem is somewhere in the code involved in determining if an endpoint has changed.

@danwinship
Copy link
Contributor

I'm not going to be able to dig into this myself for the next few days, but if you could get similar logs at V(4), with the additional logging from #122048, then that might provide enough info to figure it out from the logs.

(I still can't see how prevendpointscl != len(clusterEndpoints) would happen without endpointsChanged.Has(svcName.NamespacedName.String()) being true...)

@juliantaylor
Copy link
Contributor Author

Added more logging, including readyiness of endpoints (true/false) in endpointslicecache.go

Im pretty sure the issue is a missing synchronization between proxier.endpointsChanges.PendingChanges and endpointUpdateResult := proxier.endpointsMap.Update(proxier.endpointsChanges).
An endpoint can update between them so PendingChanges sees no change, but the rest of the code sees the updated endpoints.

here the annotated logs, this issue is also trivial to reproduce with a couple minutes patience.

# 2 service pod delete
I1205 12:21:40.836848       1 config.go:133] "Calling handler.OnEndpointSliceUpdate"
# both services
I1205 12:21:40.837137       1 endpoints.go:227] "EndpointSlice updated" service="kyverno/kyverno-svc-metrics" endpointSlice="kyverno/kyverno-svc-metrics-94nzn"
I1205 12:21:40.837280       1 bounded_frequency_runner.go:304] sync-runner: 106.448151ms since last run, possible in 893.551849ms, scheduled in 893.649795ms
# PendingChanges = 2
I1205 12:21:41.731827       1 proxier.go:797] "Syncing iptables rules (partial)" servicesChanged=0 endpointsChanged=2
# proxier.endpointsMap.Update(proxier.endpointsChanges)
# one endpoint not ready
I1205 12:21:41.731951       1 endpointslicecache.go:213] "EndpointSliceCache checkout"
I1205 12:21:41.732463       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc:https" endpoints=["100.64.155.7:9443 true","100.69.162.7:9443 true"]
I1205 12:21:41.732940       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc:https" endpoints=["100.64.155.7:9443 true","100.65.146.10:9443 false","100.69.162.7:9443 true"]
I1205 12:21:41.733363       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc-metrics:metrics-port" endpoints=["100.64.155.7:8000 true","100.69.162.7:8000 true"]
I1205 12:21:41.733801       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc-metrics:metrics-port" endpoints=["100.64.155.7:8000 true","100.65.146.10:8000 false","100.69.162.7:8000 true"]
I1205 12:21:41.733930       1 endpointslicecache.go:238] "EndpointSliceCache checkout done" changes=2
# correct apply, 2 ready endpoints, 1 unready from Categorize
E1205 12:21:41.763098       1 proxier.go:1231] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
E1205 12:21:41.766783       1 proxier.go:1231] "Change" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=2 pcl=2 local endpoints=0 plo=0 total=128
I1205 12:21:41.771917       1 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"]
I1205 12:21:41.784754       1 proxier.go:1531] "Reloading service iptables data" numServices=213 numEndpoints=431 numFilterChains=5 numFilterRules=3 numNATChains=10 numNATRules=444
I1205 12:21:41.784924       1 iptables.go:423] "Running" command="iptables-restore" arguments=["-w","5","-W","100000","--noflush","--counters"]
I1205 12:21:41.786321       1 cleanup.go:63] "Deleting conntrack stale entries for services" IPs=[]
I1205 12:21:41.786453       1 cleanup.go:69] "Deleting conntrack stale entries for services" nodePorts=[]
I1205 12:21:41.786561       1 proxier.go:790] "SyncProxyRules complete" elapsed="54.904159ms"
# done

# endpoint ready event
I1205 12:21:41.786630       1 bounded_frequency_runner.go:296] sync-runner: ran, next possible in 1s, periodic in 1h0m0s
I1205 12:22:04.018903       1 config.go:133] "Calling handler.OnEndpointSliceUpdate"
# only one service
I1205 12:22:04.019390       1 endpoints.go:227] "EndpointSlice updated" service="kyverno/kyverno-svc" endpointSlice="kyverno/kyverno-svc-cdlbr"
I1205 12:22:04.019518       1 config.go:133] "Calling handler.OnEndpointSliceUpdate"
# only one endpoint change (kyverno-svc not -metrics)
I1205 12:22:04.019783       1 proxier.go:797] "Syncing iptables rules (partial)" servicesChanged=0 endpointsChanged=1
# proxier.endpointsMap.Update(proxier.endpointsChanges)
# all 3 endpoints ready, updates BOTH services
I1205 12:22:04.019906       1 endpointslicecache.go:213] "EndpointSliceCache checkout"
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd service updates AFTER Pending -> RACE HIT >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I1205 12:22:04.019907       1 endpoints.go:227] "EndpointSlice updated" service="kyverno/kyverno-svc-metrics" endpointSlice="kyverno/kyverno-svc-metrics-94nzn"
I1205 12:22:04.020585       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc:https" endpoints=["100.64.155.7:9443 true","100.65.146.10:9443 false","100.69.162.7:9443 true"]
I1205 12:22:04.021123       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc:https" endpoints=["100.64.155.7:9443 true","100.65.146.10:9443 true","100.69.162.7:9443 true"]
I1205 12:22:04.021639       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc-metrics:metrics-port" endpoints=["100.64.155.7:8000 true","100.65.146.10:8000 false","100.69.162.7:8000 true"]
I1205 12:22:04.022097       1 endpointslicecache.go:377] "Setting endpoints for service port name" portName="kyverno/kyverno-svc-metrics:metrics-port" endpoints=["100.64.155.7:8000 true","100.65.146.10:8000 true","100.69.162.7:8000 true"]
I1205 12:22:04.022234       1 endpointslicecache.go:238] "EndpointSliceCache checkout done" changes=2
E1205 12:22:04.056508       1 proxier.go:1231] "Change" serviceName="kyverno/kyverno-svc" cluster endpoints=3 pcl=2 local endpoints=0 plo=0 total=128
# skipped as only one service was seen above
E1205 12:22:04.059969       1 proxier.go:1216] "ERROR" serviceName="kyverno/kyverno-svc-metrics" cluster endpoints=3 !==2


@aojea
Copy link
Member

aojea commented Dec 5, 2023

Im pretty sure the issue is a missing synchronization between proxier.endpointsChanges.PendingChanges and endpointUpdateResult := proxier.endpointsMap.Update(proxier.endpointsChanges).

yep

if tryPartialSync {
serviceChanged = proxier.serviceChanges.PendingChanges()
endpointsChanged = proxier.endpointsChanges.PendingChanges()
}
serviceUpdateResult := proxier.svcPortMap.Update(proxier.serviceChanges)
endpointUpdateResult := proxier.endpointsMap.Update(proxier.endpointsChanges)

func (proxier *Proxier) OnEndpointSliceAdd(endpointSlice *discovery.EndpointSlice) {
if proxier.endpointsChanges.EndpointSliceUpdate(endpointSlice, false) && proxier.isInitialized() {
proxier.Sync()
}
}
// OnEndpointSliceUpdate is called whenever modification of an existing endpoint
// slice object is observed.
func (proxier *Proxier) OnEndpointSliceUpdate(_, endpointSlice *discovery.EndpointSlice) {
if proxier.endpointsChanges.EndpointSliceUpdate(endpointSlice, false) && proxier.isInitialized() {
proxier.Sync()
}
}
// OnEndpointSliceDelete is called whenever deletion of an existing endpoint slice
// object is observed.
func (proxier *Proxier) OnEndpointSliceDelete(endpointSlice *discovery.EndpointSlice) {
if proxier.endpointsChanges.EndpointSliceUpdate(endpointSlice, true) && proxier.isInitialized() {
proxier.Sync()
}
}

@danwinship
Copy link
Contributor

what??? I could swear I checked that there was a lock there 🤦‍♂️

Adding locks to those functions shouldn't cause any deadlocks, so you could test adding

proxier.mu.Lock()
defer proxier.mu.Unlock()

to the start of OnServiceAdd, OnServiceUpdate, OnServiceDelete, OnEndpointSliceAdd, OnEndpointSliceUpdate, and OnEndpointSliceDelete and see if that makes the race go away.

I think the right fix though is probably to remove the PendingChanges functions and just add Changed maps to UpdateServiceMapResult and UpdateEndpointsMapResult so that syncProxyRules only needs to make one call to each change tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants