kubectl drain doesn't delete pods created by PetSet #33727

kzwang · 2016-09-29T06:36:58Z

Kubernetes version (use kubectl version):
1.4.0

What happened:
When use kubectl drain node, it shows error Unknown controller kind "PetSet":

What you expected to happen:
kubectl should remove all pods on that node created by PetSet

The text was updated successfully, but these errors were encountered:

0xmichalis · 2016-10-03T08:16:02Z

cc: @smarterclayton @bprashanth

mengqiy · 2016-10-03T21:22:42Z

I can reproduce it.

bprashanth · 2016-10-03T21:36:48Z

Drain is just not implemented for petset, you can cordon the node (kubectl cordon) and delete pets on it. Simply draining the node is risky because you might end up deleting all your quorum members at once, for example. Pets require extra care, if you're running something in a petset that doesn't require such care, perhaps you can get by with a replica set.

To implement drain on petset with reduced risk:

Only 1 pet must be leaving the cluster at any time
No pets should join while a pet is leaving

The first point means kubectl needs to delete a pet, wait for it to completely finish its terminationGrace, and only then delete the next one. The second point means we need to prevent the petset controller from creating pets on other nodes while this is happening.

The easiest way to do this right now is:

Pick the pets to delete, say pet-(3,4)
Update this annotation on both of them: http://kubernetes.io/docs/user-guide/petset/#troubleshooting (kind of a hack, but the point is we need to freeze the petset controller).
Delete pet-3, wait for it to disappear
Petset controller is blocked on pet-4.initialized=false
Delete pet-4, wait for it to disappear
Petset controller is blocked on pet-4.initialized=false
pet-4 disappears from apiserver, PetSet controller creates pet-3
pet-3 becomes ready, PetSet controller creates pet-4

Having a human in the loop obviously helps ordering because the human can make sure we don't keep deleting the master.

mengqiy · 2016-10-03T22:17:08Z

Pick the pets to delete, say pet-(3,4)

Update this annotation on both of them: http://kubernetes.io/docs/user-guide/petset/#troubleshooting (kind of a hack, but the point is we need to freeze the petset controller).

Delete pet-3, wait for it to disappear

Petset controller is blocked on pet-4.initialized=false

Delete pet-4, wait for it to disappear

Petset controller is blocked on pet-4.initialized=false

pet-4 disappears from apiserver, PetSet controller creates pet-3

pet-3 becomes ready, PetSet controller creates pet-4

@bprashanth If we need to delete n pods on a node, there will be at most n pods unavailable during some period of time, is that OK?

bprashanth · 2016-10-03T22:24:28Z

It's preferable to have at most 1 pod unavailabe at one time.

That pod will have a procedure to leave the cluster, eg nodetool decom. Most docs will be written in a way that describes this process for a single node (eg: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_replace_live_node.html). You will also probably find docs on how to recover the cluster, should other nodes end up dying/coming up when this happens.

The goal here is to not invoke the manual healing process, but stick as close as possible to the documented, remove single node, readd single node version.

kzwang · 2016-10-03T23:00:57Z

@bprashanth I can understand that delete pets in more complicated than delete other pods. But currently, the getController() function returns an error so that the kubectl will exit with an error without delete any pods. I think at least the command should delete all other pods and maybe have a flag to force delete pets (like current --force for force delete pods not managed by ReplicationController, ReplicaSet, etc.)?

smarterclayton · 2016-10-03T23:30:07Z

I don't want petset pods to be special. I think we may need to be using
disruption budget if you want to guarantee petsets don't take downtime
during drain (other than a manual drain impl). Disruption budget should be
responsible for this.

On Oct 3, 2016, at 7:01 PM, Kevin Wang notifications@github.com wrote:

@bprashanth https://github.com/bprashanth I can understand that delete
pets in more complicated than delete other pods. But currently, the
getController()
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/cmd/drain.go#L245
function returns an error so that the kubectl will exit with an error
without delete any pods. I think at least the command should delete all
other pods and maybe have a flag to force delete pets (like current --force
for force delete pods not managed by ReplicationController, ReplicaSet,
etc.)?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2lvOTMv5FKt9TWxWEZIZM5Y5r0tks5qwYk2gaJpZM4KJoQw
.

bprashanth · 2016-10-03T23:31:58Z

Yes my proposal was if you want to drive the drain from kubectl

smarterclayton · 2016-10-04T00:45:07Z

Should kubectl look at disruption budgets too?

On Oct 3, 2016, at 7:32 PM, Prashanth B notifications@github.com wrote:

Yes my proposal was if you want to drive the drain from kubectl

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzbOPsbfk_Gs1LsH7unEJLk1AI2Zks5qwZB7gaJpZM4KJoQw
.

bprashanth · 2016-10-04T01:04:57Z

don't see why not, I think @pwittrock and @ymqytw are working on getting kubectl to respect disruption budget in general. This might mean the petset controller needs to create a default budget.

foxish · 2016-10-12T05:32:17Z

@bprashanth Could the preStop hook be used to ensure that pods terminate in a safe manner, such as executing nodetool decommission or the equivalent for some workloads?

bprashanth · 2016-10-12T06:15:39Z

preStop is take out of deletion grace, so yeah you can in theory use either of those and preferably use the more standard one (deletion grace). Some downsides of only having one "tear down" event were discussed here: #28706 (comment)

However this alone doesn't solve the problem that more than 1 pet shouldn't decom simultaneously.

foxish · 2016-10-21T20:59:55Z

We may not need to "pause" the petset controller at all. Expanding on his earlier points, the progression would be as follows:

Only 1 pet must be leaving the cluster at any time

Cordon a node and start the drain procedure.
Find pets to delete, say pet-3, pet-4 running on that node. (find PodDisruptionBudget, which is say 1 here)
Delete either pet, say pet-3
Wait for pet-3 to become running & ready on a different node.
Delete next pet, and so on.

No pets should join while a pet is leaving

This may happen when:

a pet is being created by the PetSetController (maybe due to a scale operation) at the same time as we start draining a node, or execute a kubectl delete command.
Some distributed systems like C* have issues when workers leave and join at the same time. We are not preventing people from shooting themselves in the foot for 1.5 and instead going with documenting the caveats of draining a node running pets.

The proposed way of doing this for 1.5 as per discussion is as follows:

Use the eviction subresource instead of deletion in kubectl drain by default. (Have "kubectl drain" use eviction subresource #34277)
Eviction without a specified PodDisruptionBudget should default to maxUnavailable of 1 (Reasonable defaults with eviction and PodDisruptionBudget #35318)

chrislovecnm · 2016-10-29T21:00:16Z

So in the master PetSets are now StatefulSets

We need this in 1.5 as StatefulSets and 1.4.x as PetSets. Getting it back ported is a show stopper.

No idea how, @foxish ideas?

bprashanth · 2016-10-29T22:10:08Z

backporting api changes will break people in a minor release. generally a recipe for disaster.

chrislovecnm · 2016-10-29T22:58:19Z

@bprashanth that is amazing news ... Not. Does drain work via the API?

foxish · 2016-11-01T21:04:13Z

The proposed way of doing this for 1.5 as per discussion is as follows:

Use the eviction subresource instead of deletion in kubectl drain by default. (Have "kubectl drain" use eviction subresource #34277)

~~Eviction without a specified PodDisruptionBudget should default to maxUnavailable of 1 (Reasonable defaults with eviction and PodDisruptionBudget #35318)~~

Looks like we're still doing the first, but the second regarding PDBs and defaults needs further discussion.
I think we're going to go with special-casing StatefulSets in client-side code for 1.5 to ensure that one pod is evicted at a time and recreated before drain proceeds.
@erictune, does that sound ok?

foxish · 2016-11-01T21:06:19Z

or are we changing the examples/docs to encourage including an explict PDB, and not special-casing StatefulSets?

foxish · 2016-11-01T23:11:47Z

Based on an offline discussion, we are not going to change client-side code to support petsets especially and instead advice people to setup a PDB if they want special behavior. For 1.5, the eviction behavior will be the same for all pods, including those which are part of a statefulset. See #35318 (comment).

@ymqytw @smarterclayton

smarterclayton · 2016-11-01T23:56:15Z

Sgtm special case client is dangerous precedent and isn't our long term
goal

On Nov 1, 2016, at 7:12 PM, Anirudh Ramanathan notifications@github.com
wrote:

Based on an offline discussion, we are not going to change client-side code
to support petsets especially and instead advice people to setup a PDB if
they want special behavior. For 1.5, the eviction behavior will be the same
for all pods, including those which are part of a statefulset. See #35318
(comment)
#35318 (comment)
.

@ymqytw https://github.com/ymqytw @smarterclayton
https://github.com/smarterclayton

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxPZMf3r2BpjqZrv4Rvb4MSy0mf2ks5q58dBgaJpZM4KJoQw
.

chrislovecnm · 2016-11-02T01:42:56Z

How are upgrades of k8s going to work with stateful sets that is for instance zoo keeper?

chrislovecnm · 2016-11-02T01:46:36Z

Let me put some constraints around this. Stateful Sets have to automatically evict on a drain. Otherwise how are upgrades at scale are going to be automated?

foxish · 2016-11-02T01:50:43Z

@chrislovecnm The plan is to evict them using the eviction subresource. if there is one petset pod per node, then there is no issue when doing node drains. If one has a specific requirement where only N petset pods can be down at any given time, the right way for now would be to create a PodDisruptionBudget to reflect that. The eviction subresource respects the PDB. We plan on updating the documentation so that folks building production applications can create an explicit PDB for now.

foxish · 2016-11-02T01:51:16Z

s/petset/statefulset/g

chrislovecnm · 2016-11-02T01:58:40Z

SGTM - the feature formally know as petset

chrislovecnm · 2016-11-02T01:59:15Z

Foxish this is specific to 1.5 though?

foxish · 2016-11-02T02:04:30Z

Yes, this is specific to beta. We don't expect that the PDB/eviction mechanism for carrying out node drains will change even in GA, but we may have defaults, or some other way of specifying them, which will likely be proposed/discussed after 1.5.

@foxish

Automatic merge from submit-queue Fix kubectl drain for statefulset Support deleting pets for `kubectl drain`. Use evict to delete pods. Fixes: #33727 ```release-note Adds support for StatefulSets in kubectl drain. Switches to use the eviction sub-resource instead of deletion in kubectl drain, if server supports. ``` @foxish @caesarxuchao

k8s-github-robot added area/kubectl team/ux labels Sep 29, 2016

pwittrock added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. labels Sep 29, 2016

pwittrock assigned mengqiy Sep 29, 2016

bprashanth added kind/feature Categorizes issue or PR as related to a new feature. area/stateful-apps and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 3, 2016

mengqiy mentioned this issue Oct 25, 2016

Support eviction and statefulset in kubectl drain #35483

Merged

chrislovecnm mentioned this issue Oct 29, 2016

Pet Set in beta #28718

Closed

k8s-github-robot closed this as completed in #35483 Nov 8, 2016

This was referenced Feb 3, 2017

Rolling updates with drain and validate kubernetes/kops#1134

Merged

improve Rolling Updates for PetSets / Stateful Sets and other upgrades kubernetes/kops#1773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubectl drain doesn't delete pods created by PetSet #33727

kubectl drain doesn't delete pods created by PetSet #33727

kzwang commented Sep 29, 2016

0xmichalis commented Oct 3, 2016

mengqiy commented Oct 3, 2016

bprashanth commented Oct 3, 2016 •

edited

mengqiy commented Oct 3, 2016

bprashanth commented Oct 3, 2016 •

edited

kzwang commented Oct 3, 2016

smarterclayton commented Oct 3, 2016

bprashanth commented Oct 3, 2016

smarterclayton commented Oct 4, 2016

bprashanth commented Oct 4, 2016

foxish commented Oct 12, 2016

bprashanth commented Oct 12, 2016

foxish commented Oct 21, 2016 •

edited

chrislovecnm commented Oct 29, 2016

bprashanth commented Oct 29, 2016 •

edited

chrislovecnm commented Oct 29, 2016

foxish commented Nov 1, 2016 •

edited

foxish commented Nov 1, 2016

foxish commented Nov 1, 2016

smarterclayton commented Nov 1, 2016

chrislovecnm commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

foxish commented Nov 2, 2016 •

edited

foxish commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

foxish commented Nov 2, 2016

kubectl drain doesn't delete pods created by PetSet #33727

kubectl drain doesn't delete pods created by PetSet #33727

Comments

kzwang commented Sep 29, 2016

0xmichalis commented Oct 3, 2016

mengqiy commented Oct 3, 2016

bprashanth commented Oct 3, 2016 • edited

mengqiy commented Oct 3, 2016

bprashanth commented Oct 3, 2016 • edited

kzwang commented Oct 3, 2016

smarterclayton commented Oct 3, 2016

bprashanth commented Oct 3, 2016

smarterclayton commented Oct 4, 2016

bprashanth commented Oct 4, 2016

foxish commented Oct 12, 2016

bprashanth commented Oct 12, 2016

foxish commented Oct 21, 2016 • edited

chrislovecnm commented Oct 29, 2016

bprashanth commented Oct 29, 2016 • edited

chrislovecnm commented Oct 29, 2016

foxish commented Nov 1, 2016 • edited

foxish commented Nov 1, 2016

foxish commented Nov 1, 2016

smarterclayton commented Nov 1, 2016

chrislovecnm commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

foxish commented Nov 2, 2016 • edited

foxish commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

chrislovecnm commented Nov 2, 2016

foxish commented Nov 2, 2016

bprashanth commented Oct 3, 2016 •

edited

bprashanth commented Oct 3, 2016 •

edited

foxish commented Oct 21, 2016 •

edited

bprashanth commented Oct 29, 2016 •

edited

foxish commented Nov 1, 2016 •

edited

foxish commented Nov 2, 2016 •

edited