Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore K8s Pod restarts, failures, anomaly detections, handlers/hooks & remedial actions in Kubernetes #465

Closed
AmitKumarDas opened this issue Oct 3, 2017 · 4 comments

Comments

@AmitKumarDas
Copy link
Member

AmitKumarDas commented Oct 3, 2017

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

Why this feature request ?

This should help us organize our thoughts into actionable tasks (issues that can be implemented & closed in couple of days or max a week's time) w.r.t OpenEBS volume's High Availability. One can expect lot of actionable tasks that will go on to form the HA implementation of OpenEBS.

What this feature request should do ?

  • Explore the various mechanisms of Pod & other K8s objects w.r.t this feature request
  • Explore the various K8s components involved w.r.t this feature request
  • Explore the various config options in these K8s components w.r.t this feature request
  • Explore if custom failure detection logic/components can fit into the existing K8s components
  • Explore the communication protocols used between the components w.r.t this feature request
  • Explore the endpoints of these components w.r.t this feature request
  • Build a workflow diagram that explains the life-cycle w.r.t this feature request.
@AmitKumarDas
Copy link
Member Author

AmitKumarDas commented Oct 3, 2017

Here are the summarized details w.r.t Pod Restarts, Failures, Anomaly Detection, & Remedial Actions

Liveness Probes:

  • Kubelet uses liveness probes to know when to restart a Container
  • Kubelet kills the Container & restarts it
  • Kubelet calls the probe handlers implemented by Containers
  • Liveness probes are executed by the kubelet, so all requests are made in the kubelet network namespace.
  • Types of Probe Handlers:
    • ExecAction, HTTPGETAction, TCPSocketAction
  • Results of Probe Handlers:
    • Success, Failure, Unknown
  • If this probe not available, kubelet performs the corrective action automatically in accordance with the Pod's restartPolicy
  • If you want the Container to be killed & restarted if a probe fails, then specify a liveness probe, & specify a restartPolicy of Always or OnFailure

Readiness Probes:

  • Kubelet uses readiness probes to know when a Container is ready to start accepting traffic
  • Failure in readiness probe triggers the endpoints controller to remove the Pod IP address from the endpoints of all Services that match the Pod.
  • Note that if you just want to be able to drain requests when the Pod is deleted, you do not necessarily need a readiness probe; on deletion, the Pod automatically puts itself into an unready state regardless of whether the readiness probe exists. The Pod remains in the unready state while it waits for the Containers in the Pod to stop.
  • Use-Case: It can act as a signal to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
  • Use-Case: If you want your Container to be able to take itself down for maintenance, you can specify a readiness probe that checks a specific readiness endpoint. This endpoint is obviously not the same as liveness endpoint.

PostStart Container Hook:

  • This hook executes immediately after a container is created
  • There is no guarantee that the hook will execute before the container ENTRYPOINT
  • Q - Is this invoked during a container restart ?

PreStop Container Hook:

  • This hook is called immediately before a container is terminated.
  • It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent.
  • Q - Is this PreStop hook invoked during a container restart ?

Pod RestartPolicy

  • Pod with single container exits with a Success
    • Log completion event
    • restartPolicy = Always
      • Container is restarted & Pod phase stays Running
    • restartPolicy = OnFailure
      • Pod phase becomes Succeeded
    • restartPolicy = Never
      • Pod phase becomes Succeeded
  • Pod with single container exits with a Failure
    • Log failure event
    • restartPolicy = Always
      • Container is restarted & Pod phase stays Running
    • restartPolicy = OnFailure
      • Container is restarted & Pod phase stays Running
    • restartPolicy = Never
      • Pod phase becomes Failed
  • Pod with single container runs out of memory
    • Log OOM event
    • restartPolicy = Always
      • Container is restarted & Pod phase stays Running
    • restartPolicy = OnFailure
      • Container is restarted & Pod phase stays Running
    • restartPolicy = Never
      • Log failure event. Pod phase becomes Failed
  • Pod is running & a DISK dies
    • Log appropriate event
    • Pod phase becomes Failed
    • If running under a controller, Pod is recreated elsewhere
  • Pod is running & its node is segmented out
    • Node controller waits for timeout
    • Node controller sets Pod phase to Failed
    • If running under a controller, Pod is recreated elsewhere

Init Containers

  • They are run before the app containers are started
  • There can be multiple init containers per pod that are run in a pre-determined sequence
  • Useful w.r.t start-up related code
  • Useful w.r.t security settings
  • Useful to include utilities e.g. sed, awk, etc
  • Useful w.r.t setting up config files
  • Used in StatefulSets
  • Use-Case: frontend app waits for backend app & backend app waits for DB app
  • NOTE - Check the various resources quota given to a Pod

Involuntary Disruptions to a Pod

  • Hardware failure of the physical machine backing the node
  • Cloud Provider or Hypervisor failures makes VM disappear
  • Kernel panic
  • Node disappears from the cluster due to cluster network partition
  • Eviction of pod due to node being out of resources

Voluntary Disruptions to a Pod (may be human triggered or automated via some intelligent tool)

  • Deleting the Deployment
  • Updating Deployment's Pod template causing a restart
  • Directly deleting a Pod
  • Draining a Node for repair/upgrade/scale down
  • Removing a Pod from a node to permit something else to fit on that node

Disruption Budgets

  • Kubernetes offers features to help run highly available applications at the same time as frequent voluntary disruptions.
  • These set of features are called Disruption Budgets.
  • PodDisruptionBudget (PDB) object can be created for an app
  • PDB can limit the no. of pods of a replicated app that are down simultaneously from voluntary disruptions
  • NOTE: Tools should respect PDB by calling Eviction API instead of directly deleting Pods.
    • e.g. kubectl drain
    • Kubernetes-on-GCE cluster upgrade script
  • PDB helps in separation of Cluster Owner Role & Application Owner Role
  • Use-Case: A quorum-based app would like to ensure the no. of running replicas is above certain value
  • Use-Case: A web front end would like to ensure no. of replicas serving the load never falls below a certain percentage of total replicas
  • Use-Case: A highly available stateless app with 90% uptime
    • use PDB with minAvailable 90%
  • Use-Case: A highly available single instance stateful app
    • Do not use PDB & tolerate ocassional downtime
    • or Set PDB with maxUnavailable=0
      • NOTE: This will block Node drain. Need to remove the PDB when Node drain is required.

Eviction Policy

  • Kubelet can proactively monitor for & prevent against total starvation of a compute resource
  • In such cases, Kubelet can pro-actively fail one or more pods in-order to reclaim the starved resources

Eviction Signals

  • Kubelet triggers eviction decisions based on eviction signals
  • Eviction Signals:
    • memory.available
    • nodefs.available
    • nodefs.inodesFree
    • imagefs.available
    • imagefs.inodesFree
  • Deriving value of memory.available
    • its value is derived from cgroups instead of free -m
    • Why ? As free does not work inside a container

Eviction Threshold

  • kubelet will use the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the max allowed grace period.
  • If not specified, the kubelet will kill pods immediately with no graceful termination.
  • The evictions can be categorized as Soft Eviction or Hard Eviction based on the grace period
  • Kubelet evaluates the evictions based on monitoring interval i.e. housekeeping-interval

Node Conditions & Eviction Signals

  • MemoryPressure as a Node Condition
    • memory.available is the Eviction Signals
  • DiskPressure as a Node Condition
    • nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree are the Eviction Signals
  • Kubelet reports node status updates at a frequency specified by
    • --node-status-update-frequency which defaults to 10s

Oscillating Node Conditions

  • What is it?
    • Frequent oscillation above & below soft eviction threshold but never exceeding its associated grace period
  • This will cause poor scheduling decisions as a consequence
  • Protection against this oscillation can be done by the following flag:
    • eviction-pressure-transition-period

Node Allocatable

  • Allocatable on a Kubernetes node is defined as the amount of compute resources that are available for pods.
  • The scheduler does not over-subscribe Allocatable.
  • CPU, memory and ephemeral-storage are supported as of now.
  • Read more on cgroup flags & cgroup driver for the details/implementation

Kube Reserved

  • kube-reserved is meant to capture resource reservation for kubernetes system daemons like:
    • kubelet,
    • container runtime,
    • node problem detector, etc.
  • It is not meant to reserve resources for system daemons that are run as pods.
  • kube-reserved is typically a function of pod density on the nodes.
  • Refer to general guidelines on kube-reserved & system-reserved

References/Credits to above sections

Useful Notes

  • restartPolicy only refers to restarts of the Containers by the Kubelet on the same node

  • Failed Containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution

  • If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failed.

  • Pods with a phase of Succeeded or Failed for more than some duration (determined by the master) will expire and be automatically destroyed.

  • User controlled Pod delete can be set off with a grace period

    • If grace period expiry < Time taken to execute PreStop hook then there is a concept of extended grace period
    • Default grace for delete is 30 seconds
  • kubelet supports only two filesystem partitions:

    • The nodefs filesystem that kubelet uses for volumes, daemon logs, etc.
    • The imagefs filesystem that container runtimes uses for storing images and container writable layers.
  • kubelet auto-discovers these filesystems using cAdvisor.

@AmitKumarDas AmitKumarDas self-assigned this Oct 3, 2017
@AmitKumarDas AmitKumarDas changed the title Explore the mechanisms of Pod, etc failure detection & remedy in Kubernetes Explore the mechanisms of Pod failure detection & remedy in Kubernetes Oct 3, 2017
@AmitKumarDas AmitKumarDas changed the title Explore the mechanisms of Pod failure detection & remedy in Kubernetes Explore the mechanisms of Pod failure detections, hooks & remedial actions in Kubernetes Oct 4, 2017
@AmitKumarDas AmitKumarDas changed the title Explore the mechanisms of Pod failure detections, hooks & remedial actions in Kubernetes Explore the Pod failures, failure detections, failure handlers/hooks & remedial actions in Kubernetes Oct 4, 2017
@AmitKumarDas AmitKumarDas changed the title Explore the Pod failures, failure detections, failure handlers/hooks & remedial actions in Kubernetes Explore the Pod restarts, failures, anomaly detections, handlers/hooks & remedial actions in Kubernetes Oct 4, 2017
@AmitKumarDas AmitKumarDas changed the title Explore the Pod restarts, failures, anomaly detections, handlers/hooks & remedial actions in Kubernetes Explore K8s Pod restarts, failures, anomaly detections, handlers/hooks & remedial actions in Kubernetes Oct 4, 2017
@AmitKumarDas AmitKumarDas removed their assignment Oct 9, 2017
@AmitKumarDas
Copy link
Member Author

Since this issue acts as a reference point. Will update this issue as & when required even though it is marked as closed.

@AmitKumarDas AmitKumarDas self-assigned this Oct 17, 2017
@kmova kmova added this to the 0.5 milestone Oct 18, 2017
@loren-osborn
Copy link

Just some updated links for anyone arriving here late. The tickets above have new numbers:

Implementation Specific Tasks:

Verification Specific Tasks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants