New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl: wait for all errors and successes on podEviction #64896

Merged
merged 1 commit into from Jul 4, 2018

Conversation

@rphillips
Member

rphillips commented Jun 7, 2018

What this PR does / why we need it: This fixes kubectl drain to wait until all errors and successes are processed, instead of returning the first error. It also tweaks the behavior of the cleanup to check to see if the pod is already terminating, and if it is to not reissue the pod terminate which leads to an error getting thrown. This fix will allow kubectl drain to complete successfully when a node is draining.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
/cc @sjenning

Release note:

NONE

Reproduction steps

sleep.yml

apiVersion: v1
kind: Pod
metadata:
  name: bash
spec: 
  containers:
  - name: bash
    image: bash
    resources:
      limits:
        cpu: 500m
        memory: 500Mi
    command:
    - bash
    - -c
    - "nothing() { sleep 1; } ; trap nothing 15 ; while true; do echo \"hello\"; sleep 10; done"
  terminationGracePeriodSeconds: 3000
  restartPolicy: Never
$ kubectl create ns testing
$ kubectl create -f sleep.yml
$ kubectl delete ns testing
$ kubectl drain 127.0.0.1 --force
@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jun 8, 2018

Member

/retest

Member

rphillips commented Jun 8, 2018

/retest

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips
Member

rphillips commented Jun 12, 2018

/assign @apelisse

@@ -613,16 +620,21 @@ func (o *DrainOptions) evictPods(pods []corev1.Pod, policyGroupVersion string, g
for {
select {
case err := <-errCh:
return err

This comment has been minimized.

@frobware

frobware Jun 14, 2018

Contributor

Do we not now lose the underlying reason? Looking below I see we just do "Drain did not complete...".

@frobware

frobware Jun 14, 2018

Contributor

Do we not now lose the underlying reason? Looking below I see we just do "Drain did not complete...".

This comment has been minimized.

@rphillips

rphillips Jun 14, 2018

Member

Since there can be N different errors, they get glogged and counted. A summary error is returned on line 635. Do you think we should capture all the error messages?

@rphillips

rphillips Jun 14, 2018

Member

Since there can be N different errors, they get glogged and counted. A summary error is returned on line 635. Do you think we should capture all the error messages?

This comment has been minimized.

@liggitt

liggitt Jun 26, 2018

Member

outputting via glog like this rather than actually returning errors means DrainOptions can't easily be used programmatically or composed into larger commands

@liggitt

liggitt Jun 26, 2018

Member

outputting via glog like this rather than actually returning errors means DrainOptions can't easily be used programmatically or composed into larger commands

This comment has been minimized.

@liggitt

liggitt Jun 26, 2018

Member

print to o.ErrOut if you're going to print here, and I would probably accumulate and return the actual errors

@liggitt

liggitt Jun 26, 2018

Member

print to o.ErrOut if you're going to print here, and I would probably accumulate and return the actual errors

@sjenning

This comment has been minimized.

Show comment
Hide comment
@sjenning

sjenning Jun 26, 2018

Contributor

/cc @kubernetes/sig-cli-maintainers

This fix will allow kubectl drain to complete successfully when a node is draining.

The particular issue we hit is if the node has a pod in a terminating namespace, the drain will immediately fail with a "can't modify resource in a terminating namespace" error and fail to remove the remaining pods.

With this PR, the drain still fails, but not before trying to remove every pod on the node. This is a better situation in that all pods will be in a terminating state after the first drain.

Contributor

sjenning commented Jun 26, 2018

/cc @kubernetes/sig-cli-maintainers

This fix will allow kubectl drain to complete successfully when a node is draining.

The particular issue we hit is if the node has a pod in a terminating namespace, the drain will immediately fail with a "can't modify resource in a terminating namespace" error and fail to remove the remaining pods.

With this PR, the drain still fails, but not before trying to remove every pod on the node. This is a better situation in that all pods will be in a terminating state after the first drain.

@sjenning

This comment has been minimized.

Show comment
Hide comment
@sjenning

sjenning Jun 26, 2018

Contributor

using wider cc that I just found out about
/cc @kubernetes/sig-cli-pr-reviews

Contributor

sjenning commented Jun 26, 2018

using wider cc that I just found out about
/cc @kubernetes/sig-cli-pr-reviews

@k8s-ci-robot k8s-ci-robot added sig/cli size/M and removed size/S labels Jun 26, 2018

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jun 26, 2018

Member

@liggitt I refactored this PR to collect all the errors and make it more reusable.

Member

rphillips commented Jun 26, 2018

@liggitt I refactored this PR to collect all the errors and make it more reusable.

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Jun 26, 2018

// derived from https://github.com/golang/appengine/blob/master/errors.go
// MultiError is returned by batch operations.
type MultiError []error

This comment has been minimized.

@adohe

adohe Jun 27, 2018

Member

we already have AggregateError, what's the difference? seems no need to define multi error here.

@adohe

adohe Jun 27, 2018

Member

we already have AggregateError, what's the difference? seems no need to define multi error here.

This comment has been minimized.

@rphillips

rphillips Jun 27, 2018

Member

I didn't know about Aggregate. I updated the PR to use the Aggregate API. Thanks!

@rphillips

rphillips Jun 27, 2018

Member

I didn't know about Aggregate. I updated the PR to use the Aggregate API. Thanks!

@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Jun 27, 2018

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jun 27, 2018

Member

/test pull-kubernetes-integration

Member

rphillips commented Jun 27, 2018

/test pull-kubernetes-integration

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jun 27, 2018

Member

/test pull-kubernetes-e2e-gce

Member

rphillips commented Jun 27, 2018

/test pull-kubernetes-e2e-gce

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jun 29, 2018

Member

@sjenning @liggitt ready for review

Member

rphillips commented Jun 29, 2018

@sjenning @liggitt ready for review

@sjenning

This comment has been minimized.

Show comment
Hide comment
@sjenning

sjenning Jul 3, 2018

Contributor

thanks!
/lgtm

Contributor

sjenning commented Jul 3, 2018

thanks!
/lgtm

@@ -571,39 +572,41 @@ func (o *DrainOptions) deleteOrEvictPods(pods []corev1.Pod) error {
}
}
// evictPods return

This comment has been minimized.

@apelisse

apelisse Jul 3, 2018

Member

? :-)

@apelisse

apelisse Jul 3, 2018

Member

? :-)

}
case <-globalTimeoutCh:
return fmt.Errorf("Drain did not complete within %v", globalTimeout)
return utilerrors.NewAggregate(errors)

This comment has been minimized.

@apelisse

apelisse Jul 3, 2018

Member

You're losing the timeout error information here, which means if it times out before the first error is reported, that's going to become a success. Maybe consider just returning timeout error in that case (as it was done before)?

@apelisse

apelisse Jul 3, 2018

Member

You're losing the timeout error information here, which means if it times out before the first error is reported, that's going to become a success. Maybe consider just returning timeout error in that case (as it was done before)?

This comment has been minimized.

@rphillips

rphillips Jul 3, 2018

Member

good catch on both. fixed.

@rphillips

rphillips Jul 3, 2018

Member

good catch on both. fixed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jul 3, 2018

}
case <-globalTimeoutCh:
return fmt.Errorf("Drain did not complete within %v", globalTimeout)
}
if doneCount == numPods {

This comment has been minimized.

@apelisse

apelisse Jul 3, 2018

Member

This condition could have been part of the "for" statement: for doneCount < numPods { ... }.

@apelisse

apelisse Jul 3, 2018

Member

This condition could have been part of the "for" statement: for doneCount < numPods { ... }.

@apelisse

This comment has been minimized.

Show comment
Hide comment
@apelisse

apelisse Jul 3, 2018

Member

Thanks for fixing quickly! Feel free to fix the "while" loop (now or later).
/lgtm
/approve

Member

apelisse commented Jul 3, 2018

Thanks for fixing quickly! Feel free to fix the "while" loop (now or later).
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jul 3, 2018

@rphillips

This comment has been minimized.

Show comment
Hide comment
@rphillips

rphillips Jul 3, 2018

Member

@apelisse I fixed the for loop, this will need one more approval. Thank you!

Member

rphillips commented Jul 3, 2018

@apelisse I fixed the for loop, this will need one more approval. Thank you!

@apelisse

This comment has been minimized.

Show comment
Hide comment
@apelisse

apelisse Jul 3, 2018

Member

It'd be great if I could leave all my comments once and for all :-). How hard would it be to add tests?

/lgtm

Member

apelisse commented Jul 3, 2018

It'd be great if I could leave all my comments once and for all :-). How hard would it be to add tests?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 3, 2018

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Jul 3, 2018

Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apelisse, rphillips, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

k8s-ci-robot commented Jul 3, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apelisse, rphillips, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jul 4, 2018

Contributor

Automatic merge from submit-queue (batch tested with PRs 65776, 64896). If you want to cherry-pick this change to another branch, please follow the instructions here.

Contributor

k8s-merge-robot commented Jul 4, 2018

Automatic merge from submit-queue (batch tested with PRs 65776, 64896). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit e3fa913 into kubernetes:master Jul 4, 2018

17 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation rphillips authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

yue9944882 added a commit to yue9944882/kubernetes that referenced this pull request Jul 6, 2018

Merge pull request kubernetes#64896 from rphillips/fixes/kubectl_evic…
…tion

Automatic merge from submit-queue (batch tested with PRs 65776, 64896). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubectl: wait for all errors and successes on podEviction

**What this PR does / why we need it**: This fixes `kubectl drain` to wait until all errors and successes are processed, instead of returning the first error. It also tweaks the behavior of the cleanup to check to see if the pod is already terminating, and if it is to not reissue the pod terminate which leads to an error getting thrown. This fix will allow `kubectl drain` to complete successfully when a node is draining.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:
/cc @sjenning

**Release note**:
```release-note
NONE
```
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  containers:
  - name: bash
    image: bash
    resources:
      limits:
        cpu: 500m
        memory: 500Mi
    command:
    - bash
    - -c
    - "nothing() { sleep 1; } ; trap nothing 15 ; while true; do echo \"hello\"; sleep 10; done"
  terminationGracePeriodSeconds: 3000
  restartPolicy: Never
```

```
$ kubectl create ns testing
$ kubectl create -f sleep.yml
$ kubectl delete ns testing
$ kubectl drain 127.0.0.1 --force
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment