-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet CNI nsenter failure #42735
Comments
Could someone from @kubernetes/sig-node-bugs and @kubernetes/sig-network-bugs please help with triage? Thanks! |
This is a not a 1.6 blocker, and I saw those nsenter error above before which came from a small race window between pod status and cni execution. Those error logging is red herring. Remove this from 1.6 milestone. |
This is also happening in v1.6.1 using the kubenet network plugin:
|
Is there other symptoms besides the log? Like lost of pod IP or some sort? IIUC, pod deletion and status check run in seperate go routines. If a pod sandbox just got stopped and status check happened right after that. Status check might failed with this log. |
@freehan I see Pod's IPs still being available. I can run The problem i'm encountering is that I get 503s when accessing http services through the apiserver's proxy endpoints. |
do you see this log line a lot? or during rolling update? |
I see this line a lot regularly. I'm not even using rolling updates. I just have plain ReplicaSets / ReplicationControllers made. And since there's this kubenet failure, pods are being restarted by the RS/RC. |
@aespinosa Sounds like your problem is different from the issue here. We need to know more about your setup. Need the output of following command: Just to confirm, the pods are able to start. Later they got recreated? Pod IP changed right? |
kubelet:
Distro:
No the kubelet did not restart. After a while, the pods are running now and able to start. ifconfig:
|
Could you elaborate? What is the sequence of symptoms and operations? |
I tried to delete and have the pods be resurrected by the replicaset. The pod has been created and was allocated a Pod IP successfully. However, I still have problems of the proxy endpoints at the master not being able to be served. I'm using the aws cloud provider and I don't see my route tables being updated either. |
Are you still able to reproduce the problem related to pod IP? I am not familiar with aws cloud provider. cc @justinsb |
Thanks for the help @freehan . I figured out that my problem from the controller-manager's logs. It was saying it can't update route tables because of duplicate matching route tables when querying AWS. |
@freehan So -- ignoring the sidebar issue @aespinosa had, what's the next step here? Is it to fix the small race window between pod status and cni execution? (I'm not in a rush, just clarifying what to expect.) |
@DreadPirateShawn We need kubelet log to confirm it as a race between kubelet pod status sync and cni network plugin execution. cc: @yujuhong |
I am a bit lost in all these discussions. What exactly is the problem (besides the error message in the log) that we are dealing with? |
I hit (possibly) the same issue. With a cluster from HEAD, kube-system not in host net namespace fails to come up
Kubelet logs has the following warning messages, which the infra container being restarted in a loop
|
Just bumped image versions to 2.1.5 in addons/networking.projectcalico.org/k8s-1.6.yaml manually and it started to work correctly. I see that Calico has been already upgraded to 2.1.5 in master, so we just need to wait for new release. |
#43879 may help this, though it has been reverted and needs some fixups. |
Update? |
I've seen the same errors in my cluster after turning on RBAC. The reason is default calico.yaml (http://docs.projectcalico.org/v2.2/getting-started/kubernetes/installation/hosted/calico.yaml) lacks of cluster roles and bindings. After applying https://github.com/projectcalico/calico/blob/master/master/getting-started/kubernetes/installation/rbac.yaml all is ok. |
I am not sure if this issue is related to RBAC since it also occurs then not using calico but cni networking via flannel. I found out that the |
@dchen1107 move to next milestone Reading through the issue, there seem to be a couple of different intersecting issues going on here, it seems like it needs some more investigation. |
This feels like an umbrella issue at this point. It keeps redefining itself. |
I'm also getting this every now and then, and when it starts happening, the only workaround seems to be tearing down the cluster and bringing it back up. At least in my case it seems to be related to resource limits. I'm trying to run a pod both with and without a resource limit - every time I specify a memory limit, the pod is stuck at Here's some logs from the kubelet trying to run the resource limited pods. I'm running with a kubeadm install + flannel. EDIT: kubeadm, kubectl version 1.6.4
|
FWIW, we investigated @vishh's cluster in #42735 (comment), and found that it was not relevant to this thread, but forgot to come back and update. |
Automatic merge from submit-queue (batch tested with PRs 46441, 43987, 46921, 46823, 47276) kubelet/network: report but tolerate errors returned from GetNetNS() v2 Runtimes should never return "" and nil errors, since network plugin drivers need to treat netns differently in different cases. So return errors when we can't get the netns, and fix up the plugins to do the right thing. Namely, we don't need a NetNS on pod network teardown. We do need a netns for pod Status checks and for network setup. V2: don't return errors from getIP(), since they will block pod status :( Just log them. But even so, this still fixes the original problem by ensuring we don't log errors when the network isn't ready. @freehan @yujuhong Fixes: #42735 Fixes: #44307
@matchstick Yeah, the umbrella drift was unfortunate. In the original reply by @dchen1107 he said the original ticket was due to "a small race window between pod status and cni execution." Is there a ticket / action item for that particular issue, which I originally reported? (Or am I misunderstanding, is the original issue inseparable from the other issues people added to the ticket?) |
Kubernetes version (use
kubectl version
):Environment:
uname -a
): 3.13.0-55-genericWhat happened:
During an otherwise-normal rollingupdate for a replication controller, we see this error in the logs:
What you expected to happen:
Expected that successful rollingupdate wouldn't generate error-level logs without an error-level problem -- trying to determine what the error-level problem is.
How to reproduce it (as minimally and precisely as possible):
During our prod upgrade of various services, this occurred during 3 out of 70 rollingupdates.
Anything else we need to know:
Perhaps this is the same general issue as #25281? But I couldn't find any references to the "Unexpected command output nsenter" variation seen above, thus filing separately.
The text was updated successfully, but these errors were encountered: