Skip to content

Latest commit

 

History

History
113 lines (90 loc) · 5.52 KB

Troubleshooting Guide.MD

File metadata and controls

113 lines (90 loc) · 5.52 KB

#Kubernetes troubleshooting guide

Troubleshoot Cluster

  • Steps to know the issue

    • Look at logs /var/log/kube-apiserver.log - API Server, responsible for serving the API /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions /var/log/kube-controller-manager.log - Controller that manages replication controllers /var/log/kubelet.log - Kubelet, responsible for running containers on the node /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing
  • Root causes:

    • VM(s) shutdown
    • Network partition within cluster, or between cluster and users
    • Crashes in Kubernetes software
    • Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
    • Operator error, for example misconfigured Kubernetes software or application software
  • Some solutions

    • restart kubelet
    • verify service logs using journalctl
  • Specific scenarios:

    • Apiserver VM shutdown or apiserver crashing

      Results

      unable to stop, update, or start new pods, services, replication controller

      existing pods and services should continue to work normally, unless they depend on the Kubernetes API

    • Apiserver backing storage lost

      Results

      apiserver should fail to come up

      kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying

      manual recovery or recreation of apiserver state necessary before apiserver is restarted

      Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes

      currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver

      in future, these will be replicated as well and may not be co-located they do not have their own persistent state

    • Individual node (VM or physical machine) shuts down

      Results

      pods on that Node stop running

    • Network partition

      Results

      partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.

    • Kubelet software fault

      Results

      crashing kubelet cannot start new pods on the node

      kubelet might delete the pods or not

      node marked unhealthy replication controllers start new pods elsewhere

    • Cluster operator error

      Results

      loss of pods, services, etc

      lost of apiserver backing store

      users unable to read API etc.

Troubleshoot Pods

  • Pod status Pending

    If a Pod is stuck in Pending it means that it can not be scheduled onto a node. Generally this is because there are insufficient resources of one type or another that prevent scheduling. Look at the output of the kubectl describe ... command above. There should be messages from the scheduler about why it can not schedule your pod. Reasons include:

    You don't have enough resources: You may have exhausted the supply of CPU or Memory in your cluster, in this case you need to delete Pods, adjust resource requests, or add new nodes to your cluster. See Compute Resources document for more information.

    You are using hostPort: When you bind a Pod to a hostPort there are a limited number of places that pod can be scheduled. In most cases, hostPort is unnecessary, try using a Service object to expose your Pod. If you do require hostPort then you can only schedule as many Pods as there are nodes in your Kubernetes cluster.

    Verify kube-scheduler (Pod is pending ) and kube-controller(Pod not scaling) pods .

  • Pod status Waiting

    If a Pod is stuck in the Waiting state, then it has been scheduled to a worker node, but it can't run on that machine. Again, the information from kubectl describe ... should be informative. The most common cause of Waiting pods is a failure to pull the image. There are three things to check:

    Make sure that you have the name of the image correct. Have you pushed the image to the repository? Run a manual docker pull on your machine to see if the image can be pulled.

  • Pod status Running

    • Use logs - kubectl logs pod-name [-c ]
    • Use exec - kubectl exec -it pod-name -- command
    • Use debug - kubectl debug -it ephemeral-demo --image=busybox --target=ephemeral-demo
    • Use debug copy - kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
    • Use describe
    • Use describe rc

Troubleshoot Network

  • Troubleshoot Network

    • Verify network plugin is installed
    • Verify network policies [kubectl get netpol]
    • Verify kube-dns is working fine [kubectl -n kube-system get ep kube-dns, kubectl -n kube-system describe svc kube-dns]
  • Troubleshoot Services

    • Verify endpoints - kubectl get endpoints ${SERVICE_NAME} [number of endpoints must match the number of replicas]
    • Verify service pod-selectors labels matches the labels of Pods/Deployments
    • Verify service's port matches with Pod/Container Port
    • Verify service could be available by dnslookup [servicenames.default.svc.cluster.local]
    • Verify name resolution in /etc/resolv.conf
    • Verify if underlying pods are working
    • Verify if kube-proxy is working [ps auxw | grep kube-proxy] [Get logs from /var/log/kube-proxy.log]
    • Verify kubelet [ps auxw | grep kubelet]