Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl: wait for all errors and successes on podEviction #64896

Merged

Conversation

rphillips
Copy link
Member

What this PR does / why we need it: This fixes kubectl drain to wait until all errors and successes are processed, instead of returning the first error. It also tweaks the behavior of the cleanup to check to see if the pod is already terminating, and if it is to not reissue the pod terminate which leads to an error getting thrown. This fix will allow kubectl drain to complete successfully when a node is draining.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
/cc @sjenning

Release note:

NONE

Reproduction steps

sleep.yml

apiVersion: v1
kind: Pod
metadata:
  name: bash
spec: 
  containers:
  - name: bash
    image: bash
    resources:
      limits:
        cpu: 500m
        memory: 500Mi
    command:
    - bash
    - -c
    - "nothing() { sleep 1; } ; trap nothing 15 ; while true; do echo \"hello\"; sleep 10; done"
  terminationGracePeriodSeconds: 3000
  restartPolicy: Never
$ kubectl create ns testing
$ kubectl create -f sleep.yml
$ kubectl delete ns testing
$ kubectl drain 127.0.0.1 --force

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jun 7, 2018
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 7, 2018
@rphillips
Copy link
Member Author

/retest

@rphillips
Copy link
Member Author

/assign @apelisse

@@ -613,16 +620,21 @@ func (o *DrainOptions) evictPods(pods []corev1.Pod, policyGroupVersion string, g
for {
select {
case err := <-errCh:
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not now lose the underlying reason? Looking below I see we just do "Drain did not complete...".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be N different errors, they get glogged and counted. A summary error is returned on line 635. Do you think we should capture all the error messages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputting via glog like this rather than actually returning errors means DrainOptions can't easily be used programmatically or composed into larger commands

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print to o.ErrOut if you're going to print here, and I would probably accumulate and return the actual errors

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2018
@sjenning
Copy link
Contributor

sjenning commented Jun 26, 2018

/cc @kubernetes/sig-cli-maintainers

This fix will allow kubectl drain to complete successfully when a node is draining.

The particular issue we hit is if the node has a pod in a terminating namespace, the drain will immediately fail with a "can't modify resource in a terminating namespace" error and fail to remove the remaining pods.

With this PR, the drain still fails, but not before trying to remove every pod on the node. This is a better situation in that all pods will be in a terminating state after the first drain.

@sjenning
Copy link
Contributor

using wider cc that I just found out about
/cc @kubernetes/sig-cli-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Jun 26, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 26, 2018
@rphillips
Copy link
Member Author

@liggitt I refactored this PR to collect all the errors and make it more reusable.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 26, 2018
@rphillips rphillips force-pushed the fixes/kubectl_eviction branch 2 times, most recently from 823bfb6 to a0efb63 Compare June 26, 2018 22:32
// derived from https://github.com/golang/appengine/blob/master/errors.go

// MultiError is returned by batch operations.
type MultiError []error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have AggregateError, what's the difference? seems no need to define multi error here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about Aggregate. I updated the PR to use the Aggregate API. Thanks!

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 27, 2018
@rphillips
Copy link
Member Author

/test pull-kubernetes-integration

@rphillips
Copy link
Member Author

/test pull-kubernetes-e2e-gce

@rphillips
Copy link
Member Author

@sjenning @liggitt ready for review

@rphillips rphillips force-pushed the fixes/kubectl_eviction branch 2 times, most recently from 45495c9 to 9421527 Compare July 3, 2018 17:57
@sjenning
Copy link
Contributor

sjenning commented Jul 3, 2018

thanks!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2018
@@ -571,39 +572,41 @@ func (o *DrainOptions) deleteOrEvictPods(pods []corev1.Pod) error {
}
}

// evictPods return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? :-)

}
case <-globalTimeoutCh:
return fmt.Errorf("Drain did not complete within %v", globalTimeout)
return utilerrors.NewAggregate(errors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're losing the timeout error information here, which means if it times out before the first error is reported, that's going to become a success. Maybe consider just returning timeout error in that case (as it was done before)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch on both. fixed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2018
}
case <-globalTimeoutCh:
return fmt.Errorf("Drain did not complete within %v", globalTimeout)
}
if doneCount == numPods {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition could have been part of the "for" statement: for doneCount < numPods { ... }.

@apelisse
Copy link
Member

apelisse commented Jul 3, 2018

Thanks for fixing quickly! Feel free to fix the "while" loop (now or later).
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 3, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2018
@rphillips
Copy link
Member Author

@apelisse I fixed the for loop, this will need one more approval. Thank you!

@apelisse
Copy link
Member

apelisse commented Jul 3, 2018

It'd be great if I could leave all my comments once and for all :-). How hard would it be to add tests?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apelisse, rphillips, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 65776, 64896). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit e3fa913 into kubernetes:master Jul 4, 2018
yue9944882 pushed a commit to yue9944882/kubernetes that referenced this pull request Jul 6, 2018
…tion

Automatic merge from submit-queue (batch tested with PRs 65776, 64896). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubectl: wait for all errors and successes on podEviction

**What this PR does / why we need it**: This fixes `kubectl drain` to wait until all errors and successes are processed, instead of returning the first error. It also tweaks the behavior of the cleanup to check to see if the pod is already terminating, and if it is to not reissue the pod terminate which leads to an error getting thrown. This fix will allow `kubectl drain` to complete successfully when a node is draining.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:
/cc @sjenning

**Release note**:
```release-note
NONE
```
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  containers:
  - name: bash
    image: bash
    resources:
      limits:
        cpu: 500m
        memory: 500Mi
    command:
    - bash
    - -c
    - "nothing() { sleep 1; } ; trap nothing 15 ; while true; do echo \"hello\"; sleep 10; done"
  terminationGracePeriodSeconds: 3000
  restartPolicy: Never
```

```
$ kubectl create ns testing
$ kubectl create -f sleep.yml
$ kubectl delete ns testing
$ kubectl drain 127.0.0.1 --force
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants