-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of #53541
Conversation
/release-note-none |
/assign @mindprince |
if err != nil { | ||
Logf("Error while reading data from %v: %v", w.nodeName, err) | ||
return | ||
} | ||
for k, v := range nodeUsage { | ||
data[k] = v | ||
Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this for debugging or did you intentionally add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.
@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) { | |||
podCreationFunc = makeCudaAdditionTestPod | |||
} | |||
|
|||
// GPU drivers might have already been installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember having an offline discussion about this. But don't remember if we resolved why we had this if
block since driver installation scripts are okay with rerunning. Did you figure it out? Is it safe to remove this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.
test/e2e/scheduling/nvidia-gpus.go
Outdated
framework.Logf("Successfully created daemonset to install Nvidia drivers.") | ||
|
||
pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet")) | ||
framework.ExpectNoError(err, "getting daemonset pods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
test/e2e/scheduling/nvidia-gpus.go
Outdated
framework.ExpectNoError(err, "getting daemonset pods") | ||
framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.") | ||
rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods) | ||
framework.ExpectNoError(err, "creating ResourceUsageGather") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, make this more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
framework.Logf("Stopping ResourceUsageGather") | ||
constraints := make(map[string]framework.ResourceConstraint) | ||
// For now, just gets summary. Can pass valid constraints in the future. | ||
summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link me to what the final output is supposed to look like? Would it be useful to provide a sample link in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the example summary output in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the review! PTAL.
if err != nil { | ||
Logf("Error while reading data from %v: %v", w.nodeName, err) | ||
return | ||
} | ||
for k, v := range nodeUsage { | ||
data[k] = v | ||
Logf("Get container %v usage on node %v. CPUUsageInCores: %v, MemoryUsageInBytes: %v, MemoryWorkingSetInBytes: %v", k, w.nodeName, v.CPUUsageInCores, v.MemoryUsageInBytes, v.MemoryWorkingSetInBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to log resource usage of installer over time, mostly to see where we reach the peak of resource usage. I wondered whether I should removed it from the commit before sending out the PR but felt it could provide useful information in the test log in the future.
@@ -171,20 +172,28 @@ func testNvidiaGPUsOnCOS(f *framework.Framework) { | |||
podCreationFunc = makeCudaAdditionTestPod | |||
} | |||
|
|||
// GPU drivers might have already been installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if checking is mostly to avoid re-run the installer if we run the test on a cluster multiple times. I didn't see any issue after removing it.
test/e2e/scheduling/nvidia-gpus.go
Outdated
framework.Logf("Successfully created daemonset to install Nvidia drivers.") | ||
|
||
pods, err := framework.WaitForControlledPods(f.ClientSet, ds.Namespace, ds.Name, extensionsinternal.Kind("DaemonSet")) | ||
framework.ExpectNoError(err, "getting daemonset pods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
test/e2e/scheduling/nvidia-gpus.go
Outdated
framework.ExpectNoError(err, "getting daemonset pods") | ||
framework.Logf("Starting ResourceUsageGather for the created DaemonSet pods.") | ||
rsgather, err := framework.NewResourceUsageGatherer(f.ClientSet, framework.ResourceGathererOptions{false, false, 2 * time.Second, 2 * time.Second}, pods) | ||
framework.ExpectNoError(err, "creating ResourceUsageGather") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
framework.Logf("Stopping ResourceUsageGather") | ||
constraints := make(map[string]framework.ResourceConstraint) | ||
// For now, just gets summary. Can pass valid constraints in the future. | ||
summary, err := rsgather.StopAndSummarize([]int{50, 90, 100}, constraints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the example summary output in the PR description.
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
installer and device plugin containers. To support this, exports certain functions and fields in framework/resource_usage_gatherer.go so that it can be used in any e2e test to track any specified pod resource usage with the specified probe interval and duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, mindprince, vishh Associated issue: 368 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
InKubemark: ProviderIs("kubemark"), | ||
MasterOnly: TestContext.GatherKubeSystemResourceUsageData == "master", | ||
ResourceDataGatheringPeriod: 60 * time.Second, | ||
ProbeDuration: 5 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default was earlier 15s and this change seems to have caused a regression - #55818.
…herer Automatic merge from submit-queue (batch tested with PRs 55233, 55927, 55903, 54867, 55940). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Control logs verbosity in resource gatherer PR #53541 added some logging in resource gatherer which is a bit too verbose for normal purposes. As a result, we're seeing a lot of spam in our large cluster performance tests (e.g - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/8046/build-log.txt) This PR is making the verbosity of those logs controllable through an option. It's off by default, but turning it on for the gpu test to preserve behavior there. /cc @jiayingz @mindprince
installer and device plugin containers.
To support this, exports certain functions and fields in
framework/resource_usage_gatherer.go so that it can be used in any
e2e test to track any specified pod resource usage with the specified
probe interval and duration.
What this PR does / why we need it:
We need to quantify the resource usage of the device plugin DaemonSet to make sure it can run reliably on nodes with GPUs.
We also want to measure gpu driver installer resource usage to track any unexpected resource consumption during driver installation.
For the later part, see a related issue kubernetes/enhancements#368.
Example resource summary output:
Oct 6 12:35:07.289: INFO: Printing summary: ResourceUsageSummary
Oct 6 12:35:07.289: INFO: ResourceUsageSummary JSON
{
"100": [
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-device-plugin",
"Cpu": 0.000507167,
"Mem": 2134016
},
{
"Name": "nvidia-device-plugin-6kqxp/nvidia-driver-installer",
"Cpu": 1.915508718,
"Mem": 663330816
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-device-plugin",
"Cpu": 0.000836256,
"Mem": 2211840
},
{
"Name": "nvidia-device-plugin-l28zc/nvidia-driver-installer",
"Cpu": 1.916886293,
"Mem": 691449856
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-device-plugin",
"Cpu": 0.000515103,
"Mem": 2265088
},
{
"Name": "nvidia-device-plugin-xb4vh/nvidia-driver-installer",
"Cpu": 1.909435982,
"Mem": 832430080
}
],
"50": [
{
...
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Special notes for your reviewer:
Release note: