New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable Kubernetes v1.6.6 cluster created with Kops 1.6.2 #2928

Closed
itskingori opened this Issue Jul 13, 2017 · 15 comments

Comments

Projects
None yet
5 participants
@itskingori
Copy link
Member

itskingori commented Jul 13, 2017

Versions of kops:

$ kops version
Version 1.6.2

Version of kubernetes

$ kubectl version | grep "Server"
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Problem

I have used kops+kubernetes since 1.4.7 without issues. I'm struggling with an unstable cluster and the exhibited behaviours simply do not make sense.

The first issue

I'm experiencing cases where a node is terminated and replaced. Sometimes a master, sometimes a minion (but mostly a minion). We don't have a fix for this nor do we know why the node is being rotated.

If a master is rotated we get a lot of Unknown status pods and the everything just goes berzerk across the cluster.

These issue may be related:

The second issue

I'm experiencing cases where pods keep locked in a state of transition. By state of transition I mean that if a pod was Terminating, it stays stuck at that ... and if a node was ContainerCreating it stays stuck at that

$ kubectl get pods -o wide --namespace=$ENVIRONMENT --no-headers
grafana-3323403255-kzg2w                      1/2       CrashLoopBackOff    8         1h
grafana-mysql-0                               0/1       ContainerCreating   0         3m

I noticed that pods that are stuck in this way have dead containers. When we have dead containers /var/log/syslog is full of these:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}
Jul  8 09:22:04 ip-10-83-59-150 dockerd[1358]: time="2017-07-08T09:22:04.992595490Z" level=error msg="Handler for GET /v1.24/containers/331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f/json returned error: open /var/lib/docker/overlay/69353e3da8b9d11a16ee77343d8aa0208ca33bb5d66b300bfb9ea9e997e6d1ea/lower-id: no such file or directory"
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992798   13520 remote_runtime.go:273] ContainerStatus "331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992827   13520 kuberuntime_container.go:385] ContainerStatus for 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f error: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992840   13520 kuberuntime_manager.go:858] getPodContainerStatuses for pod "grafana-2597417898-prpwf_sandbox(fe4927e6-6345-11e7-8ca3-0e5eca502a9e)" failed: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f
Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: E0708 09:22:04.992858   13520 generic.go:239] PLEG: Ignoring events for pod grafana-2597417898-prpwf/sandbox: rpc error: code = 2 desc = Error: No such container: 331672f83b8e30143cf1035404a418414b8ee36d082ce50aeff329f77785df3f

If I clean them (dead containers) out with the below command ... they are able to proceed (/var/log/syslog is not devoid of the aforementioned errors).

$ docker rm $(docker ps -a | grep "Dead" | awk '{print $1}')

To get by I've created a cronjob to do this for me every minute ...

root@ip-xx-xx-xxx-xxx:/# cat ./root/scripts/docker-cleanup.sh
#!/bin/bash
docker rm $(docker ps -a | grep "Dead" | awk '{print $1}') &>/dev/null
true

$ crontab -l
*/1 * * * * ./root/scripts/docker-cleanup.sh

These issue may be related:

Way Forward

This is happening very often ... and I'm willing to help get to the bottom of it. Unsure what I need to provide to give more context so just let me know what I need to check and what logs I need to provide.

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 14, 2017

@bboreham I'm wondering if you could weigh in on the second issue. I'm inclined to think that weave is somehow involved here because of the errors we're getting in syslog from kubelet i.e. CNI errors.

And before you ask, we're using weave 1.9.4.

@kaazoo

This comment has been minimized.

Copy link

kaazoo commented Jul 14, 2017

I'm using kops 1.6.2 and upgraded a cluster from k8s 1.6.2 to 1.6.7. The cluster is using Calico.
No issues so far.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented Jul 14, 2017

I'm inclined to think that weave is somehow involved here because of the errors we're getting in syslog from kubelet i.e. CNI errors.

I see exactly one error mentioning CNI:

Jul  8 09:22:04 ip-10-83-59-150 kubelet[13520]: W0708 09:22:04.991325   13520 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "grafana-2597417898-prpwf_sandbox": Cannot find the network namespace, skipping pod network status for container {"docker" "4a2b4b51b0d64a64e8ffaaac53110c1b4ba019f37b755f09e67acb069d3e865f"}

This is a message from kubelet basically saying the container process was dead when it tried to check on it. And all the other messages are along the same lines. Looks to me like your Docker is very unhappy, but I have no idea why.

If there are other messages you wanted me to look at please clarify.

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 18, 2017

This is no longer an issue for me ... I believe the steps I've taken in #2982 (comment) have alleviated the problem.

@itskingori itskingori closed this Jul 18, 2017

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 25, 2017

This, unfortunately is still an issue. 😰

Re-opening.

@itskingori itskingori reopened this Jul 25, 2017

@chrislovecnm

This comment has been minimized.

Copy link
Member

chrislovecnm commented Jul 25, 2017

Did the changes with kubelet help at all?

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 26, 2017

@chrislovecnm to clarify:

  1. Giving headroom to resources via requests and limits made a very big difference in cluster stability.
  2. I have not yet applied the flags ... I created headroom by over provisioning requests so that the cluster always has excess cpu and memory. I still intend to set them ... just have been distracted by current instability issues at the moment.
  3. Every now and then I lose a node because it fails instance checks (on AWS side) and the ASG replaces it.

To summarise ... these are the issue I'm trying to solve (I don't know if they are related to my cluster's instability):

  1. kubernetes/kubernetes#45626 - k8s reports pod as "Terminated: Error" with "Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container"
  2. moby/moby#5618 (comment) - kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" · Issue #5618 · moby/moby
  3. Random node termination (as described above).

At the moment ... I'm investigating a kernel panic and trying to set up kernel dumps using kdump.

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 26, 2017

@chrislovecnm another reason I haven't used the flags is because I'm still on kops 1.6.2 and those features will probably be in kops 1.7.x.

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 28, 2017

@chrislovecnm reporting back ... using a AMI with 4.4.78 has solved no. 2 and no. 3 listed in #2928 (comment).

Changed AMI on sandbox cluster yesterday early morning. You can see I used to get 1 termination per day on average. No more node terminations👇

screen shot 2017-07-28 at 16 03 49

And the same with the "unregister_netdevice: waiting for lo to become free" issue ... 👇

screen shot 2017-07-28 at 16 07 44

@chrislovecnm

This comment has been minimized.

Copy link
Member

chrislovecnm commented Jul 28, 2017

Headroom still is not fixed?

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Jul 28, 2017

@chrislovecnm I'll be doing that next week. Have been busy investigating/addressing cluster instability issues. Will keep you posted.

@3h4x

This comment has been minimized.

Copy link

3h4x commented Oct 13, 2017

I have experienced similar issue:
kops 1.6.2
k8s 1.6.6

We don't yet know why the containers were dead. What should we look at? Our cluster has plenty of unused cpu and memory.

@chrislovecnm

This comment has been minimized.

Copy link
Member

chrislovecnm commented Oct 13, 2017

You neee to be using kops 1.7.1 with k8s 1.6.6 - please report on how it works!

@3h4x

This comment has been minimized.

Copy link

3h4x commented Oct 14, 2017

@chrislovecnm Thanks for the tip. Is it somwhere in kops docs?

@itskingori

This comment has been minimized.

Copy link
Member

itskingori commented Oct 14, 2017

Closing this because a solution was found even thought it's not quite explainable.

@itskingori itskingori closed this Oct 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment