Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pagination and filtering to improve ClusterProfiler memory usage #6511

Merged
merged 1 commit into from Oct 29, 2021

Conversation

knopt
Copy link
Contributor

@knopt knopt commented Oct 4, 2021

What this PR does / why we need it:

This MR improves (significantly reduces) memory usage of virt-api when executing ClusterProfiler dump request. Large scale cluster without this feature might experience virt-api OOMs when using ClusterProfiler.

Which issue(s) this PR fixes:
Fixes #6478

Special notes for your reviewer:

General idea behind this change is to use pagination to reduce number of pods processed at one time. Additionally, label selector is introduced for users convenience and further memory reduction.

Label selector can be used to for instance filter pods based on their component type. Additionally, field selector can be added, the question remains if it's necessary atm.

Release note:

Fixed virt-api significant memory usage when using Cluster Profiler with large KubeVirt deployments. (#6478, @knopt)

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. size/L needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 4, 2021
@kubevirt-bot
Copy link
Contributor

Hi @knopt. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-bot kubevirt-bot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Oct 4, 2021
@knopt
Copy link
Contributor Author

knopt commented Oct 4, 2021

/assign @davidvossel
Could you please have a look?

@knopt knopt force-pushed the knopt-cluster-profiler-scale branch from b6185f3 to 6774d08 Compare October 4, 2021 15:56
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Oct 4, 2021
@knopt knopt force-pushed the knopt-cluster-profiler-scale branch 3 times, most recently from 581df90 to 75665fd Compare October 6, 2021 13:01
@davidvossel davidvossel self-requested a review October 7, 2021 14:41
Comment on lines 147 to 151
req := &v1.ClusterProfilerRequest{
PageSize: int64(pageSize),
LabelSelector: labelSelector,
}
counter := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
req := &v1.ClusterProfilerRequest{
PageSize: int64(pageSize),
LabelSelector: labelSelector,
}
counter := 0
var (
req = &v1.ClusterProfilerRequest{
PageSize: int64(pageSize),
LabelSelector: labelSelector,
}
counter = 0
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if _, err := os.Stat(dir); err == nil {
oldResultsDstDir := fmt.Sprintf("%s-old-%s", dir, rand.String(4))
log.Printf("Moving already existing %q => %q\n", dir, oldResultsDstDir)
os.Rename(dir, oldResultsDstDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.Rename(dir, oldResultsDstDir)
if err := os.Rename(dir, oldResultsDstDir); err != nil {
return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


for _, pod := range podList.Items {
if podIsReadyComponent(&pod) {
pods = append(pods, pod.DeepCopy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need DeepCopy here? I think we could just simply use non-allocatable filtering and drop the DeepCopy, so before the for loop:

pods = podList.Items[:0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clientutil "kubevirt.io/client-go/util"
)

const (
maxClusterProfilerResultsPageSize = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 sounds really low for the max, is there a reason for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6478. Each additional entry adds ~9Mb of MEM usage to virt-api, so restricting max to 20 gives us at maximum ~200Mb usage.

This was just my judgement of how much additional memory is reasonable for this feature. Probably it would be great to have it configurable in a config, but I think it should go in a different MR if anyone needs it.
What's more, dumping cluster profiles is not performance related, so I would prefer safety (not killing virt-api with OOMs) over speed of downloading debugging data.

@kubevirt-bot
Copy link
Contributor

@VirrageS: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, I just made a few comments.

The pagination logic in the cluster-profiler tool is way better than i expected. Functionally, you've found a way to reduce the data cached in virt-api at one time while still making it trivial for a user to retrieve all that data using the client tooling.


// +k8s:openapi-gen=true
type ClusterProfilerRequest struct {
LabelSelector string `json:"labelSelectors"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an api warning because the variable name is LabelSelector and the json marker has an s on it.

also, does this need omiteempty as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added omitempty

@@ -221,6 +221,7 @@ API rule violation: names_match,kubevirt.io/client-go/api/v1,CDRomTarget,ReadOnl
API rule violation: names_match,kubevirt.io/client-go/api/v1,CPU,DedicatedCPUPlacement
API rule violation: names_match,kubevirt.io/client-go/api/v1,CloudInitConfigDriveSource,UserDataSecretRef
API rule violation: names_match,kubevirt.io/client-go/api/v1,CloudInitNoCloudSource,UserDataSecretRef
API rule violation: names_match,kubevirt.io/client-go/api/v1,ClusterProfilerRequest,LabelSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either change the variable to LabelSelectors, or change the json marker to labelSelector. The s character mis match in types.go is what causes this. We ideally don't want to add more known violations unless we have to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've changed labelSelector json tag

flag.IntVar(&pageSize, "page-size", defaultDumpPageSize, "Page size used for fetching profile results. Works only with dump command")

// NOTE: To profile specific kubevirt component (for example virt-api) use `kubevirt.io=virt-operator` label selector.
flag.StringVar(&labelSelector, "l", "", "Label selector for limiting pods to fetch the profiler results from. kubectl LIST label selector format expected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does label only work for the dump command as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works only for the dump command. I've added the info to the flag description + validation

Comment on lines 142 to 177
func fetchAndSaveClusterProfilerResults(c kubecli.KubevirtClient, pageSize int, labelSelector, outputDir string) error {
if err := prepareDir(outputDir); err != nil {
return err
}

req := &v1.ClusterProfilerRequest{
PageSize: int64(pageSize),
LabelSelector: labelSelector,
}
counter := 0

for {
fmt.Printf("\rFetching in progress. Downloaded so far: %d ", counter)
result, err := c.ClusterProfiler().Dump(req)
if err != nil {
return err
}

if len(result.ComponentResults) == 0 {
break
}

err = writeResultsToDisk(outputDir, result)
if err != nil {
return err
}

counter += len(result.ComponentResults)
if result.Continue == "" {
break
}
req.Continue = result.Continue
}

log.Printf("\rSUCCESS: Dumped PProf %d results for KubeVirt control plane to [%s]\n", counter, outputDir)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah interesting. So the user never sees the token for the pages, we just always download all the data all at once in a loop using this tool. I like that.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2021
@knopt knopt force-pushed the knopt-cluster-profiler-scale branch from 75665fd to 6e42b39 Compare October 14, 2021 13:08
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 14, 2021
Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2021
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2021
@fgimenez
Copy link
Contributor

Hi @knopt thanks a lot for your contribution! We require commits to be signed, you can read more about it here https://github.com/kubevirt/kubevirt/blob/main/CONTRIBUTING.md#contributor-compliance-with-developer-certificate-of-origin-dco Let us know if you need any help :)

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2021
@knopt knopt force-pushed the knopt-cluster-profiler-scale branch from 6e42b39 to 055886e Compare October 26, 2021 08:22
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed lgtm Indicates that a PR is ready to be merged. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Oct 26, 2021
@knopt knopt force-pushed the knopt-cluster-profiler-scale branch from 055886e to d750e4b Compare October 26, 2021 08:42
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2021
…usage

Signed-off-by: Tomasz Knopik <tomasz.knopik@gmail.com>
@knopt
Copy link
Contributor Author

knopt commented Oct 26, 2021

I've rebased the change + adjusted commit messages

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@davidvossel
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2021
@fgimenez
Copy link
Contributor

/ok-to-test

@kubevirt-bot kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2021
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit e3dc958 into kubevirt:main Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SIG-scale] ClusterProfiler virt-api memory usage scales lineary with number of kubevirt pods
6 participants