Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

Merged
merged 1 commit into from
Nov 14, 2017

Conversation

jiayingz
Copy link
Contributor

@jiayingz jiayingz commented Oct 6, 2017

installer and device plugin containers.
To support this, exports certain functions and fields in
framework/resource_usage_gatherer.go so that it can be used in any
e2e test to track any specified pod resource usage with the specified
probe interval and duration.

What this PR does / why we need it:
We need to quantify the resource usage of the device plugin DaemonSet to make sure it can run reliably on nodes with GPUs.
We also want to measure gpu driver installer resource usage to track any unexpected resource consumption during driver installation.
For the later part, see a related issue kubernetes/enhancements#368.

Example resource summary output:
Oct 6 12:35:07.289: INFO: Printing summary: ResourceUsageSummary
Oct 6 12:35:07.289: INFO: ResourceUsageSummary JSON
{
"100": [
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-device-plugin",
"Cpu": 0.000507167,
"Mem": 2134016
},
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-driver-installer",
"Cpu": 1.915508718,
"Mem": 663330816
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-device-plugin",
"Cpu": 0.000836256,
"Mem": 2211840
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-driver-installer",
"Cpu": 1.916886293,
"Mem": 691449856
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-device-plugin",
"Cpu": 0.000515103,
"Mem": 2265088
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-driver-installer",
"Cpu": 1.909435982,
"Mem": 832430080
}
],
"50": [
{
...

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Release note:

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 6, 2017
@jiayingz
Copy link
Contributor Author

jiayingz commented Oct 6, 2017

/release-note-none
/sig node

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 6, 2017
@vishh vishh self-assigned this Oct 6, 2017
@jiayingz
Copy link
Contributor Author

/assign @mindprince

if err != nil {
Logf("Error while reading data from %v: %v", w.nodeName, err)
return
}
for k, v := range nodeUsage {
data[k] = v
Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this for debugging or did you intentionally add it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.

@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) {
podCreationFunc = makeCudaAdditionTestPod
}

// GPU drivers might have already been installed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember having an offline discussion about this. But don't remember if we resolved why we had this if block since driver installation scripts are okay with rerunning. Did you figure it out? Is it safe to remove this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.

framework.Logf("Successfully created daemonset to install Nvidia drivers.")

pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet"))
framework.ExpectNoError(err, "getting daemonset pods")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

framework.ExpectNoError(err, "getting daemonset pods")
framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.")
rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods)
framework.ExpectNoError(err, "creating ResourceUsageGather")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, make this more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

framework.Logf("Stopping ResourceUsageGather")
constraints := make(map[string]framework.ResourceConstraint)
// For now, just gets summary. Can pass valid constraints in the future.
summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link me to what the final output is supposed to look like? Would it be useful to provide a sample link in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the example summary output in the PR description.

Copy link
Contributor Author

@jiayingz jiayingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review! PTAL.

if err != nil {
Logf("Error while reading data from %v: %v", w.nodeName, err)
return
}
for k, v := range nodeUsage {
data[k] = v
Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.

@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) {
podCreationFunc = makeCudaAdditionTestPod
}

// GPU drivers might have already been installed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.

framework.Logf("Successfully created daemonset to install Nvidia drivers.")

pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet"))
framework.ExpectNoError(err, "getting daemonset pods")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

framework.ExpectNoError(err, "getting daemonset pods")
framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.")
rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods)
framework.ExpectNoError(err, "creating ResourceUsageGather")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

framework.Logf("Stopping ResourceUsageGather")
constraints := make(map[string]framework.ResourceConstraint)
// For now, just gets summary. Can pass valid constraints in the future.
summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the example summary output in the PR description.

@ncdc ncdc removed their assignment Oct 12, 2017
@jiayingz
Copy link
Contributor Author

/retest

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2017
@jiayingz
Copy link
Contributor Author

jiayingz commented Nov 1, 2017

/retest

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2017
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2017
Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2017
installer and device plugin containers.
To support this, exports certain functions and fields in
framework/resource_usage_gatherer.go so that it can be used in any
e2e test to track any specified pod resource usage with the specified
probe interval and duration.
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 14, 2017
Copy link
Member

@rohitagarwal003 rohitagarwal003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, mindprince, vishh

Associated issue: 368

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

InKubemark: ProviderIs("kubemark"),
MasterOnly: TestContext.GatherKubeSystemResourceUsageData == "master",
ResourceDataGatheringPeriod: 60 * time.Second,
ProbeDuration: 5 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default was earlier 15s and this change seems to have caused a regression - #55818.

k8s-github-robot pushed a commit that referenced this pull request Nov 18, 2017
…herer

Automatic merge from submit-queue (batch tested with PRs 55233, 55927, 55903, 54867, 55940). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Control logs verbosity in resource gatherer

PR #53541 added some logging in resource gatherer which is a bit too verbose for normal purposes.
As a result, we're seeing a lot of spam in our large cluster performance tests (e.g - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/8046/build-log.txt)

This PR is making the verbosity of those logs controllable through an option. It's off by default, but turning it on for the gpu test to preserve behavior there.

/cc @jiayingz @mindprince
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants