Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: cgroups: be verbose about validation #108568

Merged

Conversation

stevekuznetsov
Copy link
Contributor

@stevekuznetsov stevekuznetsov commented Mar 7, 2022

Previously, callers of Exists() would not know why the cGroup was or
was not existing. In one call-site in particular, the kubelet would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com

/kind bug

NONE

/cc @sjenning @derekwaynecarr @smarterclayton

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Mar 7, 2022
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 7, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 7, 2022
@stevekuznetsov
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 7, 2022
@smarterclayton
Copy link
Contributor

@rphillips @mrunalp this lgtm but wanted someone more familiar with cgroup manager (if that's either of you two, or others)

if !cgroupManager.Exists(cgroupRoot) {
return nil, fmt.Errorf("invalid configuration: cgroup-root %q doesn't exist", cgroupRoot)
if err := cgroupManager.Validate(cgroupRoot); err != nil {
return nil, fmt.Errorf("invalid configuration: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: this is terminal to the kubelet process and not explaining what error caused the configuration to be invalid is poor UX

@stevekuznetsov stevekuznetsov force-pushed the skuznets/verbose-error branch 2 times, most recently from 7cb3880 to 0e475f3 Compare March 7, 2022 16:34
@rphillips
Copy link
Member

/assign @kolyshkin

@rphillips
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 7, 2022
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not change the API (I don't like two methods with different names that do the same thing, with the only difference is return value)? Say, increase the log level in in Exists to a warning, and add logging to the missing places?

If not, maybe it's better to change all instances of if cm.Exists() to if cm.Validate() == nil.

@kolyshkin
Copy link
Contributor

Also, ideally I'd like this to go on top of #107149 since it incorporates some non-trivial changes.

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Mar 8, 2022

@kolyshkin no. I very strongly disagree with you. Logging critical errors that cause the process to exit at a high verbosity level and asking the user to re-run their setup in order to see the error should not be how we go about this. Also, if you notice, there were places where the logging did not expose the reason for Exists() not being true. If the method signature does not require that the error (or similar) be reported, this type of bug will no doubt happen again in the future. Having the compiler ensure that the method returns some reason is valuable. I am happy to refactor all uses of this method if you would like to see only one exposed.

@stevekuznetsov
Copy link
Contributor Author

@kolyshkin also while I understand the request re: the other PR, it looks very large, very old and it's not clear the timeline on which it would merge.

@stevekuznetsov stevekuznetsov force-pushed the skuznets/verbose-error branch 2 times, most recently from d02257c to 1296562 Compare March 9, 2022 14:56
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 9, 2022
@stevekuznetsov
Copy link
Contributor Author

@rphillips @kolyshkin updated to entirely remove the Exists() calls

@rphillips
Copy link
Member

rphillips commented Mar 9, 2022

This is changing and removing an established API. Exists() should probably stay and Validate be added. Exists() can simply call Validate().

The internal code can call Validate() to propagate the errors.

@pacoxu pacoxu moved this from Triage to Needs Reviewer in SIG Node PR Triage Mar 10, 2022
@stevekuznetsov
Copy link
Contributor Author

Gotcha. @rphillips that was the original factoring. Let me revert to that one...

Previously, callers of `Exists()` would not know why the cGroup was or
was not existing. In one call-site in particular, the `kubelet` would
entirely fail to start if the cGroup validation did not succeed. In
these cases we MUST explain what went wrong and pass that information
clearly to the caller. Previously, some but not all of the reasons for
invalidation were logged at a low log-level instead. This led to poor
UX.

The original method was retained on the interface so as to make this
diff small.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov
Copy link
Contributor Author

@rphillips reverted to the original state

@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2022
@stevekuznetsov
Copy link
Contributor Author

/assign @derekwaynecarr
for approval :)

@smarterclayton
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2022
@k8s-ci-robot k8s-ci-robot merged commit c227403 into kubernetes:master Mar 10, 2022
SIG Node PR Triage automation moved this from Needs Reviewer to Done Mar 10, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants