Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl drain doesn't delete pods created by PetSet #33727

Closed
kzwang opened this issue Sep 29, 2016 · 27 comments
Closed

kubectl drain doesn't delete pods created by PetSet #33727

kzwang opened this issue Sep 29, 2016 · 27 comments
Assignees
Labels
area/kubectl area/stateful-apps kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@kzwang
Copy link
Contributor

kzwang commented Sep 29, 2016

Kubernetes version (use kubectl version):
1.4.0

What happened:
When use kubectl drain node, it shows error Unknown controller kind "PetSet":

What you expected to happen:
kubectl should remove all pods on that node created by PetSet

@pwittrock pwittrock added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. labels Sep 29, 2016
@0xmichalis
Copy link
Contributor

cc: @smarterclayton @bprashanth

@mengqiy
Copy link
Member

mengqiy commented Oct 3, 2016

I can reproduce it.

@bprashanth
Copy link
Contributor

bprashanth commented Oct 3, 2016

Drain is just not implemented for petset, you can cordon the node (kubectl cordon) and delete pets on it. Simply draining the node is risky because you might end up deleting all your quorum members at once, for example. Pets require extra care, if you're running something in a petset that doesn't require such care, perhaps you can get by with a replica set.

To implement drain on petset with reduced risk:

  1. Only 1 pet must be leaving the cluster at any time
  2. No pets should join while a pet is leaving

The first point means kubectl needs to delete a pet, wait for it to completely finish its terminationGrace, and only then delete the next one. The second point means we need to prevent the petset controller from creating pets on other nodes while this is happening.

The easiest way to do this right now is:

  1. Pick the pets to delete, say pet-(3,4)
  2. Update this annotation on both of them: http://kubernetes.io/docs/user-guide/petset/#troubleshooting (kind of a hack, but the point is we need to freeze the petset controller).
  3. Delete pet-3, wait for it to disappear
  4. Petset controller is blocked on pet-4.initialized=false
  5. Delete pet-4, wait for it to disappear
  6. Petset controller is blocked on pet-4.initialized=false
  7. pet-4 disappears from apiserver, PetSet controller creates pet-3
  8. pet-3 becomes ready, PetSet controller creates pet-4

Having a human in the loop obviously helps ordering because the human can make sure we don't keep deleting the master.

@bprashanth bprashanth added kind/feature Categorizes issue or PR as related to a new feature. area/stateful-apps and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 3, 2016
@mengqiy
Copy link
Member

mengqiy commented Oct 3, 2016

  1. Pick the pets to delete, say pet-(3,4)
  2. Update this annotation on both of them: http://kubernetes.io/docs/user-guide/petset/#troubleshooting (kind of a hack, but the point is we need to freeze the petset controller).
  3. Delete pet-3, wait for it to disappear
  4. Petset controller is blocked on pet-4.initialized=false
  5. Delete pet-4, wait for it to disappear
  6. Petset controller is blocked on pet-4.initialized=false
  7. pet-4 disappears from apiserver, PetSet controller creates pet-3
  8. pet-3 becomes ready, PetSet controller creates pet-4

@bprashanth If we need to delete n pods on a node, there will be at most n pods unavailable during some period of time, is that OK?

@bprashanth
Copy link
Contributor

bprashanth commented Oct 3, 2016

It's preferable to have at most 1 pod unavailabe at one time.

That pod will have a procedure to leave the cluster, eg nodetool decom. Most docs will be written in a way that describes this process for a single node (eg: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_replace_live_node.html). You will also probably find docs on how to recover the cluster, should other nodes end up dying/coming up when this happens.

The goal here is to not invoke the manual healing process, but stick as close as possible to the documented, remove single node, readd single node version.

@kzwang
Copy link
Contributor Author

kzwang commented Oct 3, 2016

@bprashanth I can understand that delete pets in more complicated than delete other pods. But currently, the getController() function returns an error so that the kubectl will exit with an error without delete any pods. I think at least the command should delete all other pods and maybe have a flag to force delete pets (like current --force for force delete pods not managed by ReplicationController, ReplicaSet, etc.)?

@smarterclayton
Copy link
Contributor

I don't want petset pods to be special. I think we may need to be using
disruption budget if you want to guarantee petsets don't take downtime
during drain (other than a manual drain impl). Disruption budget should be
responsible for this.

On Oct 3, 2016, at 7:01 PM, Kevin Wang notifications@github.com wrote:

@bprashanth https://github.com/bprashanth I can understand that delete
pets in more complicated than delete other pods. But currently, the
getController()
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/cmd/drain.go#L245
function returns an error so that the kubectl will exit with an error
without delete any pods. I think at least the command should delete all
other pods and maybe have a flag to force delete pets (like current --force
for force delete pods not managed by ReplicationController, ReplicaSet,
etc.)?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2lvOTMv5FKt9TWxWEZIZM5Y5r0tks5qwYk2gaJpZM4KJoQw
.

@bprashanth
Copy link
Contributor

Yes my proposal was if you want to drive the drain from kubectl

@smarterclayton
Copy link
Contributor

Should kubectl look at disruption budgets too?

On Oct 3, 2016, at 7:32 PM, Prashanth B notifications@github.com wrote:

Yes my proposal was if you want to drive the drain from kubectl


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzbOPsbfk_Gs1LsH7unEJLk1AI2Zks5qwZB7gaJpZM4KJoQw
.

@bprashanth
Copy link
Contributor

don't see why not, I think @pwittrock and @ymqytw are working on getting kubectl to respect disruption budget in general. This might mean the petset controller needs to create a default budget.

@foxish
Copy link
Contributor

foxish commented Oct 12, 2016

@bprashanth Could the preStop hook be used to ensure that pods terminate in a safe manner, such as executing nodetool decommission or the equivalent for some workloads?

@bprashanth
Copy link
Contributor

preStop is take out of deletion grace, so yeah you can in theory use either of those and preferably use the more standard one (deletion grace). Some downsides of only having one "tear down" event were discussed here: #28706 (comment)

However this alone doesn't solve the problem that more than 1 pet shouldn't decom simultaneously.

@foxish
Copy link
Contributor

foxish commented Oct 21, 2016

We may not need to "pause" the petset controller at all. Expanding on his earlier points, the progression would be as follows:

Only 1 pet must be leaving the cluster at any time

  1. Cordon a node and start the drain procedure.
  2. Find pets to delete, say pet-3, pet-4 running on that node. (find PodDisruptionBudget, which is say 1 here)
  3. Delete either pet, say pet-3
  4. Wait for pet-3 to become running & ready on a different node.
  5. Delete next pet, and so on.

No pets should join while a pet is leaving

This may happen when:

  • a pet is being created by the PetSetController (maybe due to a scale operation) at the same time as we start draining a node, or execute a kubectl delete command.
    Some distributed systems like C* have issues when workers leave and join at the same time. We are not preventing people from shooting themselves in the foot for 1.5 and instead going with documenting the caveats of draining a node running pets.

The proposed way of doing this for 1.5 as per discussion is as follows:

@chrislovecnm
Copy link
Contributor

So in the master PetSets are now StatefulSets

We need this in 1.5 as StatefulSets and 1.4.x as PetSets. Getting it back ported is a show stopper.

No idea how, @foxish ideas?

@bprashanth
Copy link
Contributor

bprashanth commented Oct 29, 2016

backporting api changes will break people in a minor release. generally a recipe for disaster.

@chrislovecnm
Copy link
Contributor

@bprashanth that is amazing news ... Not. Does drain work via the API?

@foxish
Copy link
Contributor

foxish commented Nov 1, 2016

The proposed way of doing this for 1.5 as per discussion is as follows:

Looks like we're still doing the first, but the second regarding PDBs and defaults needs further discussion.
I think we're going to go with special-casing StatefulSets in client-side code for 1.5 to ensure that one pod is evicted at a time and recreated before drain proceeds.
@erictune, does that sound ok?

@foxish
Copy link
Contributor

foxish commented Nov 1, 2016

or are we changing the examples/docs to encourage including an explict PDB, and not special-casing StatefulSets?

@foxish
Copy link
Contributor

foxish commented Nov 1, 2016

Based on an offline discussion, we are not going to change client-side code to support petsets especially and instead advice people to setup a PDB if they want special behavior. For 1.5, the eviction behavior will be the same for all pods, including those which are part of a statefulset. See #35318 (comment).

@ymqytw @smarterclayton

@smarterclayton
Copy link
Contributor

Sgtm special case client is dangerous precedent and isn't our long term
goal

On Nov 1, 2016, at 7:12 PM, Anirudh Ramanathan notifications@github.com
wrote:

Based on an offline discussion, we are not going to change client-side code
to support petsets especially and instead advice people to setup a PDB if
they want special behavior. For 1.5, the eviction behavior will be the same
for all pods, including those which are part of a statefulset. See #35318
(comment)
#35318 (comment)
.

@ymqytw https://github.com/ymqytw @smarterclayton
https://github.com/smarterclayton


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33727 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxPZMf3r2BpjqZrv4Rvb4MSy0mf2ks5q58dBgaJpZM4KJoQw
.

@chrislovecnm
Copy link
Contributor

How are upgrades of k8s going to work with stateful sets that is for instance zoo keeper?

@chrislovecnm
Copy link
Contributor

Let me put some constraints around this. Stateful Sets have to automatically evict on a drain. Otherwise how are upgrades at scale are going to be automated?

@foxish
Copy link
Contributor

foxish commented Nov 2, 2016

@chrislovecnm The plan is to evict them using the eviction subresource. if there is one petset pod per node, then there is no issue when doing node drains. If one has a specific requirement where only N petset pods can be down at any given time, the right way for now would be to create a PodDisruptionBudget to reflect that. The eviction subresource respects the PDB. We plan on updating the documentation so that folks building production applications can create an explicit PDB for now.

@foxish
Copy link
Contributor

foxish commented Nov 2, 2016

s/petset/statefulset/g

@chrislovecnm
Copy link
Contributor

SGTM - the feature formally know as petset

@chrislovecnm
Copy link
Contributor

Foxish this is specific to 1.5 though?

@foxish
Copy link
Contributor

foxish commented Nov 2, 2016

Yes, this is specific to beta. We don't expect that the PDB/eviction mechanism for carrying out node drains will change even in GA, but we may have defaults, or some other way of specifying them, which will likely be proposed/discussed after 1.5.

k8s-github-robot pushed a commit that referenced this issue Nov 8, 2016
Automatic merge from submit-queue

Fix kubectl drain for statefulset

Support deleting pets for `kubectl drain`. 
Use evict to delete pods.

Fixes: #33727

```release-note
Adds support for StatefulSets in kubectl drain.
Switches to use the eviction sub-resource instead of deletion in kubectl drain, if server supports.
```

@foxish @caesarxuchao
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubectl area/stateful-apps kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Development

No branches or pull requests

9 participants