Proposal: Forgiveness #1574

Open
bgrant0607 opened this Issue Oct 3, 2014 · 22 comments
@bgrant0607
Member

Forked from #598 -- see that issue for more background. We don't need to implement this now, but recording the idea for posterity.

If we need more control over pod durability (e.g., to configure the node controller behavior -- #1366), rather than a single pin/persistence bit, I suggest we implement forgiveness: a list of (event type, optional duration, optional rate) of disruption events (e.g., host unreachability) the pod will tolerate. We could support an any event type and infinite duration for pods that want to be pinned regardless of what happens.

This approach would generalize nicely for cases where, for example, applications wanted to endure reboots but give up in the case of extended outages or in the case that the disk goes bad. We're also going to want to use a similar spec for availability requirements / failure tolerances of sets of pods.

@smarterclayton
Contributor

Is the decision to unbind a pod from a host (delete it) a decision a node controller makes (delete all pods on node, then delete node)? Or is it reactive by the system from the node being deleted (node deleted, therefore some process needs to clean up pods)? The former seems like annotations and a custom node controller would solve (delete all pods immediately that have forgiveness true). The latter seems like it would have to be supported by the system itself.

This is probably a separate issue but I don't know where it is - should a node have a concept of being unavailable for scheduling?

@bgrant0607
Member

@smarterclayton Yes, a node should have a concept of being unavailable for scheduling, and the system (a node controller) should also be responsible for removing pods from non-functional nodes -- see #1366. This separates concerns of pod controlling vs. node controlling (for the most part). Both control systems will get more and more sophisticated over time.

@bgrant0607
Member

Sketch of forgiveness from #3058:

// goes in Pod
    ForgivenessPolicy *ForgivenessPolicy `json:"forgivenessPolicy,omitempty"`

type ForgivenessPolicy struct {
    Thresholds []ForgivenessThreshold `json:"thresholds"`
}
type ForgivenEvent string
const (
    ForgivenDead ForgivenEvent = "Dead"
    ForgivenNotReady ForgivenEvent = "NotReady"
    ForgivenDeadNode ForgivenEvent = "DeadNode"
    ForgivenUnreachableNode ForgivenEvent = "UnreachableNode"
)
type ForgivenessThreshold struct {
    Event ForgivenEvent `json:"event"`
    ContiguousSeconds int `json:"contiguousSeconds,omitempty"`
    Rate int `json:"rate,omitempty"`
    RatePeriodSeconds int `json:"ratePeriodSeconds,omitempty"`
}
@smarterclayton
Contributor

OpenShift implemented a "manage-node" command which models an evacuate action. Currently it's up to the admin to decide whether the pods will be moved, but a forgiveness value on the pod could be the input to that command (just like the automatic evacuation implemented in node controller). A node with a pod with a long forgiveness policy is also less transient, so there is potentially value in making scheduling decisions based on forgiveness (where increasing the forgiveness of a given node has a cost, but scheduling a pod with a lower forgiveness is free). That's probably more a model for more traditional operations teams, and is certainly not a high priority.

@smarterclayton
Contributor

Do we want to simplify forgiveness to try to get it for 1.1? As a general policy value it's useful to have in the system.

@davidopp
Member

I don't know if it would be high enough priority to commit to for 1.1, but I could imagine the first step would just be a bool (Pod is treated normally, i.e. deleted by NodeController after standard timeout, vs. Pod stays bound to the node forever unless manually deleted or node is deleted)..

@smarterclayton
Contributor

How about a duration?

We want to dramatically reduce the fail detecting time for certain
workloads on the cluster, which implies some way of distinguishing pods on
an unhealthy node that should be aggressively considered down (such that
the replication controller gets the earliest chance to proceed).

Maybe there's a more tactical way to manage that, but that does seem
something that would let admins start to manage SLA more effectively that
fits into the long term pattern.

@davidopp
Member

That seems reasonable to me, I was just suggesting the simplest thing we could do. What you suggested would also make it a subset of what @bgrant0607 suggested in his comment above, so
#1574 (comment)
you'd just have a ContiguousSeconds and basically nothing else. (As opposed to the bool which doesn't really fit into what he suggested).

Not sure about prioritization for 1.1 though.

@smarterclayton
Contributor

If we added the value, but did not leverage it in node controller (or made
it the simplest possible expression, such as changing the evacuate behavior
to be the min forgiveness), would we be in a bad spot?

As far as prioritization it's something we'd be interested in implementing,
so it's really a question of API design and scope of change.

@bgrant0607
Member

cc @mml

@bgrant0607
Member

See also discussion of tolerations: #17190

@smarterclayton
Contributor

We're seeing the need for forgiveness as a mitigation for solving data gravity in the short term. If it coincidentally makes evacuations more effective all the better.

I assume the node controller changes are the trickiest part, because we want to make pod deletion choices based on forgiveness, not the blanket node timeout. We would need to know what the minimum forgiveness pod on a node, then tie the deletion along with the progression of status changes.

@mml
Contributor
mml commented Dec 7, 2015

@bgrant0607 Reading this and #17190 back to back, it seems cleanest to consider all kinds of unavailability, as well as things like dedication, as a single concept: taints. An unreachable node has a taint of one kind, one that's marked unschedulable has a different kind of taint. There would be a taint for a failed disk and also one that implies "this machine is dedicated", and some kind of administrative control would be required to say who can ask for which kinds of tolerations.

This is pretty clean, but how do taints differ from labels? Can we reduce the number of concepts by 1 more?

@bgrant0607
Member

@mml I agree we should unify unavailability using taints and tolerations. Forgiveness and tolerations were distinct concepts in Omega for reasons that don't apply to Kubernetes.

As for unifying attributes (labels) and taints, that would require more thought. One could imagine NodeConstraints, NodePreferences (see #18261), and NodeTolerations having similar structures and concepts. However, we want the default behavior for taints and ordinary labels to be different, which is why we developed the concept of taints originally, automatically setting node selectors to achieve taint-like behavior has some problems (discussed in #17190), and I can't think of a reason why we'd want taints to apply to API resources other than Nodes (and maybe PersistentVolumes). Taints vs. other approaches should be discussed on #18263.

@bgrant0607
Member
@davidopp
Member
davidopp commented Jul 7, 2016

Updating with the current status and plan.

Taints and Tolerations have been implemented as an alpha feature (see #25320 for details on what remains to be done to complete it).

@kevin-wangzefeng, @Dingshujie, and I came up with the following short-term plan:

  • add a ForgivenessSeconds pointer field to Toleration. Unset means tolerate forever. Valid values when set are > 0, meaning forgive for that many seconds.
  • refactor the set of TaintEffect to be NoSchedule, PreferNoSchedule, NoScheduleNoAdmit, and NoExecute. This allows you to specify all the same things you can in the current scheme, but separating NoExecute makes it clearer how a pod that requests forgiveness can choose between NoSchedule behavior and no NoSchedule behavior. Of course this assumes that nodes publish both a NoSchedule taint and a NoExecute taint when they want the default behavior to be no schedule and evict. NoExecute is the only toleration that allows forgiveness to be specified.
  • until we get a chance to have NodeController write taints for machine problems (which is the long-term plan), we will put a small hack in NodeController that causes it to not evict a pod that specifies a special pre-defined NoExecute toleration (something like {Key: NodeDown, Operator: Exists, Effect: NoExecute, ForgivenessSeconds: <x>})

@Dingshujie asked some good questions about what should the PodStatus and ContainerStatus look like for a pod/container whose node is down and the pod/container is in its forgiveness period. Current proposal is:

  • podPhase stays as "Running"
  • podCondition goes to {type=Ready, status=false}. We need to think of an appropriate Reason and Message.
  • containerStatus: set Ready=false. Don't touch ContainerStatus.ContainerState (just leave it as running)
  • for pod with initContainers, keep the podCondition {type=Initialized} and InitContainerStatuses as they were

cc/ @yujuhong

@yujuhong
Contributor
yujuhong commented Jul 7, 2016

@Dingshujie asked some good questions about what should the PodStatus and ContainerStatus look like for a pod/container whose node is down and the pod/container is in its forgiveness period. Current proposal is:

podPhase stays as "Running"
podCondition goes to {type=Ready, status=false}. We need to think of an appropriate Reason and Message.
containerStatus: set Ready=false. Don't touch ContainerStatus.ContainerState (just leave it as running)
for pod with initContainers, keep the podCondition {type=Initialized} and InitContainerStatuses as they were

SGTM.

@Dingshujie Dingshujie was assigned by davidopp Jul 7, 2016
@erictune
Member

Is there any doc that has a couple of examples of concrete real world applications and what forgiveness values we recommend for Taints and Tolerations?

@smarterclayton
Contributor

We have a few apps running that are "must never be evicted" where the user
has chosen to turn off node eviction today. I have fewer examples of
30 except a few production dbs that would prefer to only be
moved if absolutely necessary because restore >30 min

On Oct 29, 2016, at 10:08 AM, Eric Tune notifications@github.com wrote:

Is there any doc that has a couple of examples of concrete real world
applications and what forgiveness values we recommend for Taints and
Tolerations?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxJruBotk6j9s0vX4xKZBhw6DyGgks5q41NBgaJpZM4CqrbH
.

@davidopp
Member
davidopp commented Oct 29, 2016 edited

Yeah I think there are three categories

  • want to be evicted sooner than the default 5 minutes -- we have an example of this from a GKE customer running a session-based application where every container handles a set of users, and if the node becomes unavailable that set of users can't be served until a replacement container is created on a new node (they requested 1 minute timeout)
  • want to be evicted after more than 5 minutes but less than infinite -- I think Clayton answered this one; unreplicated apps that take a long time to rebuild their state if restarted on a different node
  • never want to be evicted -- anything that runs as a DaemonSet today falls into this category (we can get rid of the hacky code that handles DaemonSet specially with respect to evictions, and just give them infinite tolerations); sounds like Clayton has other examples (I would think anything that is imported from a legacy VM world where the app can't handle "moving" once started without operator intervention and can't be rewritten to be a PersistentSet would fall into this category)
@smarterclayton
Contributor

Anything taking to legacy network attached storage (iscsi, fibrechannel,
nfs over SAN) will probably want to be using stateful set + infinite
forgiveness for quite a while until we implement volume locking and have a
fencer in place. That's most mission critical data stores I'm aware of -
Oracle, SAP, etc. Some of those will be "nodes as pets" where there is
extrinsic HA applied to the node and eviction would be impractical.

In terms of general "not being surprised by the platform" there are a class
of existing workloads that should have very high forgiveness until they
have time to verify they can be moved.

On Oct 29, 2016, at 2:52 PM, David Oppenheimer notifications@github.com
wrote:

Yeah I think there are three categories

  • want to be evicted sooner than the default 5 minutes -- we have an
    example of this from a GKE customer running a session-based application
    where every container handles a set of users, and if the node becomes
    unavailable that set of users can't be served until a replacement container
    is created on a new node (they requested 1 minute timeout)
  • want to be evicted after more than 5 minutes but less than infinite --
    I think Clayton answered this one; unreplicated apps that take a long time
    to rebuild their state if restarted on a different node
  • never want to be evicted -- anything that runs as a DaemonSet today
    falls into this category (we can get rid of the hacky code that handles
    DaemonSet specially with respect to evictions, and just give them infinite
    tolerations); sounds like Clayton has other examples (I would think
    anything that is imported from a legacy VM world where the app can't handle
    "moving" once started and can't be rewritten to be a PersistentSet would
    fall into this category)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2jWRQ5cv8IHLjj7oSg6tUcLbPnaks5q45XkgaJpZM4CqrbH
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment