Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(apf): not reset apf when panic #111850

Merged
merged 2 commits into from
Aug 24, 2022
Merged

Conversation

leileiwan
Copy link
Contributor

@leileiwan leileiwan commented Aug 15, 2022

issue: #111852

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 15, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.25 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.25.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Aug 15 01:36:41 UTC 2022.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 15, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @leileiwan. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 15, 2022
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 15, 2022
@fedebongio
Copy link
Contributor

/assign @MikeSpreitzer @tkashem
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2022
@@ -172,7 +172,7 @@ func (cfgCtlr *configController) Handle(ctx context.Context, requestDigest Reque
defer func() {
klog.V(7).Infof("Handle(%#+v) => fsName=%q, distMethod=%#+v, plName=%q, isExempt=%v, queued=%v, Finish() => panicking=%v idle=%v",
requestDigest, fs.Name, fs.Spec.DistinguisherMethod, pl.Name, isExempt, queued, panicking, idle)
if idle {
if idle && !panicking {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly considered and rejected this behavior, see #97206 (comment) .

#111852 reports multiple mysteries, the problem likely lies in one or more of those.

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Aug 17, 2022

OK, here's the real bug. maybeReap assumes that every exempt priority level is useless. That is wrong: processOldPLsLocked only retains an exempt priority level if there is reason to keep it. So maybeReap is making exactly the opposite of the correct decision, for exempt priority levels.

The problem is normally masked because Finish always returns idle=false; the problem is exposed when a panic rips past Finish returning.

The right fix is to make maybeReap log and immediately return when considering an exempt priority level.

This does not explain why a syncOnce invocation took 12 minutes; that is a separate mystery.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 18, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: leileiwan / name: wanlei (22b0be9, ac8fe8970158601955a0f1036a439ed4ad3e650b)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 18, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 18, 2022
@leileiwan
Copy link
Contributor Author

leileiwan commented Aug 18, 2022

OK, here's the real bug. maybeReap assumes that every exempt priority level is useless. That is wrong: processOldPLsLocked only retains an exempt priority level if there is reason to keep it. So maybeReap is making exactly the opposite of the correct decision, for exempt priority levels.

The problem is normally masked because Finish always returns idle=false; the problem is exposed when a panic rips past Finish returning.

The right fix is to make maybeReap log and immediately return when considering an exempt priority level.

This does not explain why a syncOnce invocation took 12 minutes; that is a separate mystery.

ok, fix maybeReap is better.
for syncOnce took 12 minutes, I guess there is one or more bad lock, because the frozen master node qps is very low and the pl has no queue. we change the pl objects apf reset success as soon as possible, but when many panics happen not.

kubectl  get prioritylevelconfigurations
NAME              TYPE      ASSUREDCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
catch-all         Limited   5                          <none>   <none>     <none>             51d
exempt            Exempt    <none>                     <none>   <none>     <none>             51d
global-default    Limited   80                         <none>   <none>     <none>             50d
workload-high     Limited   40                         <none>   <none>     <none>             50d
....

@@ -869,6 +869,10 @@ func (cfgCtlr *configController) maybeReap(plName string) {
klog.V(7).Infof("plName=%s, plState==nil", plName)
return
}
if plState.queues == nil {
klog.V(7).Infof("plName=%s, plState.quiescing=%v, plState.numPending=%d, plState.queues==nil", plName, plState.quiescing, plState.numPending)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message should explicitly say that the priority level is exempt. BTW, for an exempt priority level it is certain that: it is not quiescing and it has no queues; there is no value in printing those things in a message that also explicitly says "exempt". The numPending is not fixed, but also not relevant, so it would be fine to not print that either.

if plState.queues == nil {
klog.V(7).Infof("plName=%s, plState.quiescing=%v, plState.numPending=%d, plState.queues==nil", plName, plState.quiescing, plState.numPending)
return
}
if plState.queues != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an if statement here, the condition is certainly true.

@MikeSpreitzer
Copy link
Member

I am having trouble understanding the words in #111850 (comment) about the syncOne invocation that took 12 minutes. Certainly there are locks acquired during that time. They are all intended to be held for only a brief amount of time; it is hard to understand why there would be a noticeable wait to acquire any of them.

Except for the locks in sample-and-watermark histograms. Normally those are held for only a short amount of time, but if it has been a long wall-clock time since the last update, then the next update can hold the lock for a long time. This was fixed for release 1.19 by #94146. Of course, I do not actually know if this is the explanation for that 12 minute syncOnce.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 19, 2022
Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: leileiwan, MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2022
@dims
Copy link
Member

dims commented Aug 23, 2022

/release-note-none
/kind bug

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Aug 23, 2022
@dims dims removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 23, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4f0cf1b into kubernetes:master Aug 24, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 24, 2022
@MikeSpreitzer
Copy link
Member

@leileiwan : see https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md for how to propagate this back to supported earlier releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants