Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

jiayingz · 2017-10-06T18:15:03Z

installer and device plugin containers.
To support this, exports certain functions and fields in
framework/resource_usage_gatherer.go so that it can be used in any
e2e test to track any specified pod resource usage with the specified
probe interval and duration.

What this PR does / why we need it:
We need to quantify the resource usage of the device plugin DaemonSet to make sure it can run reliably on nodes with GPUs.
We also want to measure gpu driver installer resource usage to track any unexpected resource consumption during driver installation.
For the later part, see a related issue kubernetes/enhancements#368.

Example resource summary output:
Oct 6 12:35:07.289: INFO: Printing summary: ResourceUsageSummary
Oct 6 12:35:07.289: INFO: ResourceUsageSummary JSON
{
"100": [
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-device-plugin",
"Cpu": 0.000507167,
"Mem": 2134016
},
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-driver-installer",
"Cpu": 1.915508718,
"Mem": 663330816
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-device-plugin",
"Cpu": 0.000836256,
"Mem": 2211840
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-driver-installer",
"Cpu": 1.916886293,
"Mem": 691449856
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-device-plugin",
"Cpu": 0.000515103,
"Mem": 2265088
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-driver-installer",
"Cpu": 1.909435982,
"Mem": 832430080
}
],
"50": [
{
...

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Release note:

jiayingz · 2017-10-06T18:15:20Z

/release-note-none
/sig node

jiayingz · 2017-10-10T21:13:23Z

/assign @mindprince

rohitagarwal003 · 2017-10-11T03:49:05Z

test/e2e/framework/resource_usage_gatherer.go

 		if err != nil {
 			Logf("Error while reading data from %v: %v", w.nodeName, err)
 			return
 		}
 		for k, v := range nodeUsage {
 			data[k] = v
+			Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes)


Was this for debugging or did you intentionally add it?

I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.

rohitagarwal003 · 2017-10-11T04:01:34Z

test/e2e/scheduling/nvidia-gpus.go

@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) {
 		podCreationFunc = makeCudaAdditionTestPod
 	}

-	// GPU drivers might have already been installed.


I remember having an offline discussion about this. But don't remember if we resolved why we had this if block since driver installation scripts are okay with rerunning. Did you figure it out? Is it safe to remove this condition?

I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.

rohitagarwal003 · 2017-10-11T04:02:42Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.Logf("Successfully created daemonset to install Nvidia drivers.")
+
+	pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet"))
+	framework.ExpectNoError(err, "getting daemonset pods")


Can you make this more descriptive?

rohitagarwal003 · 2017-10-11T04:02:56Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.ExpectNoError(err, "getting daemonset pods")
+	framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.")
+	rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods)
+	framework.ExpectNoError(err, "creating ResourceUsageGather")


Same here, make this more descriptive.

rohitagarwal003 · 2017-10-11T04:04:22Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.Logf("Stopping ResourceUsageGather")
+	constraints := make(map[string]framework.ResourceConstraint)
+	// For now, just gets summary. Can pass valid constraints in the future.
+	summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints)


Can you link me to what the final output is supposed to look like? Would it be useful to provide a sample link in the code?

I added the example summary output in the PR description.

jiayingz

Thanks a lot for the review! PTAL.

jiayingz · 2017-10-11T17:34:10Z

test/e2e/framework/resource_usage_gatherer.go

 		if err != nil {
 			Logf("Error while reading data from %v: %v", w.nodeName, err)
 			return
 		}
 		for k, v := range nodeUsage {
 			data[k] = v
+			Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes)


I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.

jiayingz · 2017-10-11T17:36:58Z

test/e2e/scheduling/nvidia-gpus.go

@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) {
 		podCreationFunc = makeCudaAdditionTestPod
 	}

-	// GPU drivers might have already been installed.


I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.

jiayingz · 2017-10-11T17:38:25Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.Logf("Successfully created daemonset to install Nvidia drivers.")
+
+	pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet"))
+	framework.ExpectNoError(err, "getting daemonset pods")


jiayingz · 2017-10-11T17:41:29Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.ExpectNoError(err, "getting daemonset pods")
+	framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.")
+	rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods)
+	framework.ExpectNoError(err, "creating ResourceUsageGather")


jiayingz · 2017-10-11T17:42:23Z

test/e2e/scheduling/nvidia-gpus.go

+	framework.Logf("Stopping ResourceUsageGather")
+	constraints := make(map[string]framework.ResourceConstraint)
+	// For now, just gets summary. Can pass valid constraints in the future.
+	summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints)


I added the example summary output in the PR description.

jiayingz · 2017-10-12T18:31:06Z

/retest

jiayingz · 2017-11-01T20:15:28Z

/retest

vishh

/approve
/lgtm

installer and device plugin containers. To support this, exports certain functions and fields in framework/resource_usage_gatherer.go so that it can be used in any e2e test to track any specified pod resource usage with the specified probe interval and duration.

rohitagarwal003

/lgtm

k8s-github-robot · 2017-11-14T01:28:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, mindprince, vishh

Associated issue: 368

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~test/OWNERS~~ [vishh]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-11-14T04:34:59Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-11-14T05:51:16Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

shyamjvs · 2017-11-15T22:11:22Z

test/e2e/framework/framework.go

+			InKubemark:                  ProviderIs("kubemark"),
+			MasterOnly:                  TestContext.GatherKubeSystemResourceUsageData == "master",
+			ResourceDataGatheringPeriod: 60 * time.Second,
+			ProbeDuration:               5 * time.Second,


The default was earlier 15s and this change seems to have caused a regression - #55818.

@jiayingz

…herer Automatic merge from submit-queue (batch tested with PRs 55233, 55927, 55903, 54867, 55940). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Control logs verbosity in resource gatherer PR #53541 added some logging in resource gatherer which is a bit too verbose for normal purposes. As a result, we're seeing a lot of spam in our large cluster performance tests (e.g - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/8046/build-log.txt) This PR is making the verbosity of those logs controllable through an option. It's off by default, but turning it on for the gpu test to preserve behavior there. /cc @jiayingz @mindprince

k8s-github-robot assigned ncdc and sttts Oct 6, 2017

jiayingz force-pushed the e2e-stats branch from 3816934 to c1197c5 Compare October 6, 2017 20:03

vishh self-assigned this Oct 6, 2017

k8s-ci-robot assigned rohitagarwal003 Oct 10, 2017

rohitagarwal003 reviewed Oct 11, 2017

View reviewed changes

jiayingz force-pushed the e2e-stats branch from c1197c5 to 382c701 Compare October 11, 2017 17:44

jiayingz commented Oct 11, 2017

View reviewed changes

ncdc removed their assignment Oct 12, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2017

jiayingz force-pushed the e2e-stats branch from 382c701 to b9009ba Compare November 1, 2017 00:43

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2017

vishh approved these changes Nov 9, 2017

View reviewed changes

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2017

jiayingz force-pushed the e2e-stats branch from b9009ba to ae36f8e Compare November 14, 2017 00:24

k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 14, 2017

rohitagarwal003 approved these changes Nov 14, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2017

k8s-github-robot merged commit 710523e into kubernetes:master Nov 14, 2017

jiayingz mentioned this pull request Nov 14, 2017

No need to run device-plugin manually anymore. GoogleCloudPlatform/container-engine-accelerators#25

Merged

shyamjvs reviewed Nov 15, 2017

View reviewed changes

This was referenced Nov 15, 2017

Flaky SSH failures to master causing kubemark tests to fail #55818

Closed

Control logs verbosity in resource gatherer #55940

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

jiayingz commented Oct 6, 2017 •

edited

jiayingz commented Oct 6, 2017

jiayingz commented Oct 10, 2017

rohitagarwal003 Oct 11, 2017

jiayingz Oct 11, 2017

rohitagarwal003 Oct 11, 2017

jiayingz Oct 11, 2017

rohitagarwal003 Oct 11, 2017

jiayingz Oct 11, 2017

rohitagarwal003 Oct 11, 2017

jiayingz Oct 11, 2017

rohitagarwal003 Oct 11, 2017

jiayingz Oct 11, 2017

jiayingz left a comment

jiayingz Oct 11, 2017

jiayingz Oct 11, 2017

jiayingz Oct 11, 2017

jiayingz Oct 11, 2017

jiayingz Oct 11, 2017

jiayingz commented Oct 12, 2017

jiayingz commented Nov 1, 2017

vishh left a comment

rohitagarwal003 left a comment

k8s-github-robot commented Nov 14, 2017

k8s-github-robot commented Nov 14, 2017

k8s-github-robot commented Nov 14, 2017

shyamjvs Nov 15, 2017

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541

Conversation

jiayingz commented Oct 6, 2017 • edited

jiayingz commented Oct 6, 2017

jiayingz commented Oct 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayingz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayingz commented Oct 12, 2017

jiayingz commented Nov 1, 2017

vishh left a comment

Choose a reason for hiding this comment

rohitagarwal003 left a comment

Choose a reason for hiding this comment

k8s-github-robot commented Nov 14, 2017

k8s-github-robot commented Nov 14, 2017

k8s-github-robot commented Nov 14, 2017

Choose a reason for hiding this comment

jiayingz commented Oct 6, 2017 •

edited