-
Notifications
You must be signed in to change notification settings - Fork 40.4k
Kubelet gets "Timeout: Too large resource version" error from the API server after network outage #91073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet gets "Timeout: Too large resource version" error from the API server after network outage #91073
Comments
/sig api-machinery |
/sig api-machinery |
It seems that after some more time, kubelet recovers. Sometimes it takes one hour, sometimes two. Recovery is successful when I receive similar logs:
What is the reason that it takes so long for it to recover? |
The root cause is #82428, deduplicating against that issue |
Debugging, tracing the code, I am not really sure that this is the same issue as the other mentioned here. I've made a very stupid patch on my kubelet/client-go:
I am pretty sure that this is not the best solution, but, it seems that it forces a full list every time ListAndWatch is called. |
You're correct, sorry about that... I misread the timeout message. The error it is getting is coming from the server, which means kubelet -> apiserver connectivity is fine. This message comes from the watch cache, which means the watch cache in the API server is behind where the kubelet's informer was, which is pretty surprising. A few questions:
|
cc @jpbetz does this ring any bells? |
cc @wojtek-t for eyes on watch cache |
I don't think I've hit this problem specifically. Since this could happen due to an etcd partition or due to a partition between a apiserver and etcd, the answers questions asked in #91073 (comment) seem like the appropriate next things to figure out. |
Yeah - I agree with Joe. It seems like some network partition issue of the apiserver the node was connected to. |
I will be able to answer the questions in detail later, but in short: there are 3 masters set up with kubeadm, haproxy load-balances the queries to the apiservers. The only thing to reproduce the issue is to disconnect the node from the network for a few minutes. Masters or etcd members are not touched, and anyway the cluster seems to be healthy. |
Today, it occured on a diffrent cluster. Setup is similar: stacked etcd on master nodes, haproxy load-balances traffic between them. Just was playing with haproxy servers and the HA setup, and after a few restarts of haproxy, one of the nodes entered this state. So during a haproxy restart, all kubelets' connections to apiservers are terminated, but rarely, one of them enters this unhealthy state. |
Multi apiserver cluster
Stacked etcd members are present.
I think In my setup it is not relevant, as all the kubelets are talking to the apiserver through the load-balancer, which picks one of the available apiservers. I am doing a rolling restarts of the nodes, and there were at least one occurence, when even one of the master nodes got stucked in this state. How can I help debugging/fixing this? |
How can we proceed with this issue? |
got same problem in a old cluster, ( up since 3+years) after updating to kube 1.18.3 , one node kubelet start logging those errors ( ~10 errors/minute with various resources)
I've now just restarted kubelet and now seems working again.. Update: |
I think this problem is getting serious. I am not 100% sure, but today, during node restarts (including masters) coredns was migrated multiple times, and unfortunately, kube-proxy pods had not received the updated situation, thus, dns resolution was not working inside the cluster. After detecting this, a simple restart of all kube-proxy pods resolved our problem. There were no network issues during this, so I suspect we've hit the same bug, now in kube-proxy. |
I am trying to dig this. Now, I've written a little go prog which uses the same reflector what kubelet uses. It just starts a watch for CSIDrivers, and I've made client-go to print the last resourceVersion. Starting the program multiple times (i.e. connecting to different masters) produces the following:
Right now I dont exactly know what url is being fetched and what arguments are passed in, I still have to figure that out. But now it seems, that different masters return different resouceVersions. Etcd does not report any problems/issues. |
Any ideas, suggestions where to look further? |
Today I have re-initialized 2 of my 3 master nodes to make sure etcd replicas are not corrupted. Just after the nodes have been joined, the same behavior could be observed: connecting to different masters resulted in different resourceVersions. |
Now, it can be seen, that querying different apiservers again return different resourceVersions:
192.168.8.60:16443 is a haproxy load-balancing traffic between 3 masters. |
@wojtek-t @jpbetz please let me know where to go further. My cluster runs on PI boards, perhaps, not the fastest ones. But if this issue depends on slow hardware, than it can pop up anytime on fast hardware as well.As I wrote, today I reinitialized 2 of my 3 masters to make sure no partitioning occured. Also, please have a look at my comments. Or, if you have a clue that my setup is broken, please let me know! |
This has been fixed/mitigated in 1.18. Going to backport to 1.18 now. |
This has been mitigated in 1.18 and head. The more proper fix is proposed in kubernetes/enhancements#1878 Closing this one. |
This vendors a later version of prometheus' golang client (0.8.0 -> 0.9.4) to allow `go mod tidy`to work properly. It also updates the k8s libraries from 0.18.6 to 0.18.8 to avoid hitting kubernetes/kubernetes#91073
This vendors a later version of prometheus' golang client (0.8.0 -> 0.9.4) to allow `go mod tidy`to work properly. It also updates the k8s libraries from 0.18.6 to 0.18.8 to avoid hitting kubernetes/kubernetes#91073
Hello, as i saw, i have the same issue in 1.19.3 version:
Our setup is 3 etcd, 2 apiserver bound on those 3 etcd and a haproxy in front of apiservers |
If that happens once - that can definitely happen (and that's fine). The bug was that the components were stuck with those errors. |
1.18.0 get same error . restart docker and kubelet return OK. so , should I update the 1.18.0 to 1.18.6+ ? |
What happened:
I have disconnected a node from network for a few minutes. After reconnecting, I keep receiving such error messages from kubelet on the node, even after 15 minutes in reconnected state:
What you expected to happen:
I expect that after network recovery kubelet reconnects to the Apiserver as before, and after a recovery such timeouts do not occur.
How to reproduce it (as minimally and precisely as possible):
Just have a node connected to the cluster. Then, disconnect it from the network for 3-4 minutes, then reconnect. Then observe kubelet's logs.
Anything else we need to know?:
I have strict tcp keepalive settings in place on master and worker nodes, but this should not be the cause.
Restarting kubelet solves the issue, the error messages disappear.
Environment:
kubectl version
):bare metal
cat /etc/os-release
):uname -a
):kubeadm
kube-router
The text was updated successfully, but these errors were encountered: