Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-initiated eviction: handle deleteOptions correctly #116554

Merged

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Mar 13, 2023

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

when adding a DisruptionTarget condition into a pod that will be deleted by an eviction

  • handle ResourceVersion and Preconditions correctly
  • handle DryRun option correctly

Which issue(s) this PR fixes:

Fixes #116552

Special notes for your reviewer:

more details in the issue

Does this PR introduce a user-facing change?

Fixed two regressions introduced by the PodDisruptionConditions feature (on by default in 1.26):
* pod eviction API calls returned spurious precondition errors and required a second evict API call to succeed
* dry-run eviction API calls persisted a DisruptionTarget condition into the pod being evicted

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/issues/3329

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2023
@atiratree
Copy link
Member Author

/triage accepted
/priority important-soon

This is a regression since, it requires more eviction requests for the kubectl drain to drain a node. And it could affect other tools that do eviction.

@liggitt @soltysh can you please review?

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 13, 2023
@atiratree
Copy link
Member Author

cc @ravisantoshgudimetla

@liggitt
Copy link
Member

liggitt commented Mar 13, 2023

This is a regression since, it requires more eviction requests for the kubectl drain to drain a node. And it could affect other tools that do eviction.

what PR regressed this?

@atiratree
Copy link
Member Author

it was added in #110959 as an alpha and promoted to a beta and changed to default in #113360

pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
@atiratree atiratree force-pushed the eviction-resource-version-fix branch from a7e51d6 to 4ca6333 Compare March 13, 2023 23:37
}
_, _, err := r.store.Delete(ctx, name, rest.ValidateAllObjectFunc, options)
_, _, err = r.store.Delete(ctx, name, rest.ValidateAllObjectFunc, deleteOptionsCopy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one think that could be problematic is if somebody does update after our update, before we try to delete the pod. In that case we should do a rollback of the condition. Do we care about such a race? Should I implement the rollback? @liggitt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... I thought we had a controller reconciling bogus conditions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a disruption controller which cleans up stale DisruptionTarget conditions:

func (dc *DisruptionController) syncStalePodDisruption(ctx context.Context, key string) error {
. By stale we mean conditions which are 2min old if there was no deletion in the meanwhile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think, because it is an unlikely race, it should be enough to leave it to the controller

pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
}
_, _, err := r.store.Delete(ctx, name, rest.ValidateAllObjectFunc, options)
_, _, err = r.store.Delete(ctx, name, rest.ValidateAllObjectFunc, deleteOptionsCopy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... I thought we had a controller reconciling bogus conditions

pkg/registry/core/pod/storage/eviction.go Outdated Show resolved Hide resolved
pkg/registry/core/pod/storage/eviction.go Show resolved Hide resolved
@mimowo
Copy link
Contributor

mimowo commented Mar 14, 2023

@atiratree thank you for working on the fix. I think we should cherry-pick it to 1.26.

@atiratree atiratree force-pushed the eviction-resource-version-fix branch from 4ca6333 to 2e5222c Compare March 14, 2023 12:15
@atiratree
Copy link
Member Author

+1 for the cherry-pick

@bobbypage
Copy link
Member

bobbypage commented Mar 17, 2023

Thanks for the link. Is this a kubelet issue with kubemark?

It seems like the test that failed is cmd/kubemark/app TestHollowNode/kubelet (3.00s) and possibly related to kubemark, which is a new test added as part of #116645

@bobbypage
Copy link
Member

Filed #116696 for the race

@liggitt
Copy link
Member

liggitt commented Mar 17, 2023

kubemark

I think it's kubemark or the kubemark unit test

}

eviction := newV1Eviction(ns.Name, updatedPod.Name, deleteOption)
err = clientSet.CoreV1().RESTClient().Post().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use noRetriesRESTClient here as well and call the normal clientset Evict method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

@atiratree atiratree force-pushed the eviction-resource-version-fix branch from fbdd4e3 to bda1607 Compare March 17, 2023 14:37
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an apiserver expert, but:
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ccfc0a2e2ad90f7a66fcea47e11ddaa2ef8a40c7

@alculquicondor
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 17, 2023
when adding a DisruptionTarget condition into a pod that will be deleted

- handle ResourceVersion and Preconditions correctly
- handle DryRun option correctly

Co-authored-by: Jordan Liggitt jordan@liggitt.net
@atiratree atiratree force-pushed the eviction-resource-version-fix branch from bda1607 to 51c0e23 Compare March 17, 2023 21:28
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2023
@liggitt
Copy link
Member

liggitt commented Mar 17, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0a0e050fc81b12e21b5c9fe7efeb0f55dc36e63e

@liggitt liggitt moved this from Changes requested to API review completed, 1.27 in API Reviews Mar 17, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2023
@liggitt
Copy link
Member

liggitt commented Mar 17, 2023

once this passes tests, go ahead and open a cherry-pick to release-1.26

@atiratree
Copy link
Member Author

flaky test Test_Run_OneVolumeDetachFailNodeWithReadWriteOnce
/test pull-kubernetes-unit

@k8s-ci-robot k8s-ci-robot merged commit fe91bc2 into kubernetes:master Mar 17, 2023
13 checks passed
@atiratree
Copy link
Member Author

opened a cherry-pick #116750

k8s-ci-robot added a commit that referenced this pull request Mar 30, 2023
…16554-upstream-release-1.26

Automated cherry pick of #116554: API-initiated eviction: handle deleteOptions correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.27
Archived in project
Development

Successfully merging this pull request may close these issues.

API-initiated eviction does not work on first try in certain scenarios, but does work on second try
6 participants