Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent uptimeFunc from being called everytime CheckHealth is called #609

Merged
merged 1 commit into from
Aug 21, 2021

Conversation

mcshooter
Copy link
Contributor

@mcshooter mcshooter commented Aug 13, 2021

There is currently an issue for Windows nodes where NPD is causing the CPU to hit and persist at 100%. After investigating, it looks like there was a change that moved uptimeFunc to be called every time we checked the health through health checker. We shouldn't be calling uptimeFunc unless we know it has been determined that one of the services is unhealthy and there are no patterns to be checking. For Windows, because the query to get and calculate the uptime is pretty heavy, having it be called every 10s or so for each of the separate services would unnecessarily increase usage. This change will revert part of how we CheckHealth and move the loopBackTime logic inside of the logPatternHealthCheck function to only be done if there are log patterns that need to be checked.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 13, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @mcshooter. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 13, 2021
@mcshooter
Copy link
Contributor Author

/hold testing in progress

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2021
@mcshooter
Copy link
Contributor Author

/cc @smileusd

@k8s-ci-robot
Copy link
Contributor

@mcshooter: GitHub didn't allow me to request PR reviews from the following users: smileusd.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @smileusd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mcshooter
Copy link
Contributor Author

/cc @pjh

@k8s-ci-robot k8s-ci-robot requested a review from pjh August 13, 2021 17:17
@mcshooter
Copy link
Contributor Author

/cc @ibabou

@mcshooter
Copy link
Contributor Author

cc/ @Random-Liu

@jeremyje
Copy link
Contributor

/sig node
/sig windows
/priority important-soon
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 13, 2021
pkg/healthchecker/health_checker.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker.go Show resolved Hide resolved
Copy link

@pjh pjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed but don't have much to offer. At a quick glance it's challenging to read the code here (existing and modified) since I'm not sure what loopBackTime means.

But +1 to the goal of not calling the uptimeFunc as often to reduce CPU utilization. We'll still need to work on further changes to avoid starting numerous Powershell sessions on Windows.

@mcshooter
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2021
@mcshooter
Copy link
Contributor Author

Was able to test locally on a kubernetes node
Screen Shot 2021-08-16 at 6 25 53 PM

@Random-Liu
Copy link
Member

@liyanhui1228

@Random-Liu
Copy link
Member

Random-Liu commented Aug 19, 2021

In this case, do we still intend to use log pattern or repair on windows?

If not, can we just declare it as not properly supported, and return error if users try to use log pattern or repair on windows?

In that way, we can no-op those functions for windows instead.

The current code makes me feel like we will never use it, because it has serious performance issue. In that case, I prefer we disable the feature and cleanup the corresponding code, instead of moving the code around to make the code never run just for us. Because others may still hit the issue if they try to use it.

@mcshooter
Copy link
Contributor Author

In this case, do we still intend to use log pattern or repair on windows?

If not, can we just declare it as not properly supported, and return error if users try to use log pattern or repair on windows?

In that way, we can no-op those functions for windows instead.

The current code makes me feel like we will never use it, because it has serious performance issue. In that case, I prefer we disable the feature and cleanup the corresponding code, instead of moving the code around to make the code never run just for us. Because others may still hit the issue if they try to use it.

We still do use the repair functionality. So it doesn't make sense to remove this as a windows functionality. The code will still run to repair and call uptime but only in the case when it expects it to.

@Random-Liu
Copy link
Member

Offline discussed, there is no good way to get the last start time when the service is down.

So the current fix makes sense to me

@Random-Liu
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ibabou, mcshooter, Random-Liu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2021
@k8s-ci-robot k8s-ci-robot merged commit 7a33650 into kubernetes:master Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants