Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

kaelanspatel · 2021-09-17T22:59:51Z

What does this PR change?

Changes the Allocation struct to features new fields GPURequestAverage and GPUUsageAverage, the later of which is supported through a Prometheus query for the Nvidia dcgm-exporter metric DCGM_FI_DEV_GPU_UTIL. Also adds Allocation.GPUEfficiency(), which calculates the ratio of GPU usage to requests, as with CPU and RAM efficiencies.

Notes

These features will only work if the Nvidia dcgm-exporter pod is running on the node. Otherwise, the value GPUUsageAverage will be zero. This means that even if you are utilizing a GPU by allocating it, you may have zero usage. Setup is not entirely trivial and is detailed here.

Does this PR rely on any other PRs?

Whitelist Nvidia GPU DCGM_FI_DEV_GPU_UTIL metric kubecost/cost-analyzer-helm-chart#1071

How does this PR impact users? (This is the kind of thing that goes in release notes!)

Adds support for integrating Nvidia DCGM GPU metrics into Kubecost and calculating Allocation GPU efficiency.

Links to Issues or ZD tickets this PR addresses or fixes

How was this PR tested?

Updated allocation_test.go to feature checks for new Allocation fields, including efficiency. Additionally, created a gpu test load generator pod on cluster and observed metrics in Prometheus + through the api.

Prometheus shows DCGM_FI_DEV_GPU_UTIL being autodetected and whitelisted:

Allocation for load-generating pod shows consistent GPURequestAverage and GPUUsageAverage:

…e bingen

michaelmdresser

Couple of comments. I'd also like to see a documentation update (allocation.md, probably) that highlights the new fields and why usage will be 0 (and efficiency 0) when the NVIDIA metrics are not available.

Also, @mbolt35 is this the correct way of version-bumping the schema now? I haven't had to do it since we split the Asset and Allocation versions.

michaelmdresser · 2021-10-04T16:34:40Z

pkg/costmodel/allocation.go

@@ -954,6 +960,33 @@ func applyGPUsAllocated(podMap map[podKey]*Pod, resGPUsRequested []*prom.QueryRe

 		hrs := pod.Allocations[container].Minutes() / 60.0
 		pod.Allocations[container].GPUHours = res.Values[0].Value * hrs
+		pod.Allocations[container].GPURequestAverage = res.Values[0].Value


Is this an actual average? Should we be querying this more like CPU/RAM request averages?

It is. Can you expand on this? To my knowledge it is being queried and handled like RAM requests.

This function does some funky stuff with queryFmtGPUsAllocated -- look at the first if statement in this function, I think you might run into trouble because requests is a separate array and can get overwritten by allocation.

However, I'm not super familiar with GPU. This might not be a problem at all, in which case 👍

Other resources (RAM, CPU, etc) have separate function for applying requests, usage, and average. We don't need to adhere to that style if this is correct, its just what tipped me off that this looked a little odd.

Ah, yes I see. I believe this is consistent with what we do throughout, which is override requests with limits (which should also be what is happening in any pod using a GPU regardless; you're just supposed to have limits).

There may be problems with this particular function that I am not aware of, but the value of "GPUs allocated per allocation" right now is back-calculated from GPUHours anyway, so really all this change does is make that a separate field in Allocation.

michaelmdresser · 2021-10-04T16:35:40Z

pkg/costmodel/allocation.go

+			pod.AppendContainer(container)
+		}
+
+		pod.Allocations[container].GPUUsageAverage = res.Values[0].Value * 0.01


Why is this multiplied by 0.01? A comment would be helpful.

I'll add one. For posterity, it's because (for whatever reason) the "percentage" expressed by this metric is in whole numbers, rather than the decimals we're using for CPU/RAM i.e. 100% is 100, not 1.

Aha, of course. Excellent.

kaelanspatel added 6 commits September 17, 2021 12:35

Implementation of DCGM_FI_DEV_GPU_UTIL query logic in ComputeAllocation

2cb1208

Add fields GPURequestAverage and GPUUsageAverage to Allocation, updat…

7968b7f

…e bingen

Correct query to use proper agg_over_time

dbafe6c

Add Allocation GPUEfficiency() and tests

e13677a

Code cleanup

672431c

More cleanup

7048ab7

kaelanspatel added the schema change Change to ETL schema(s), requiring a rebuild label Sep 23, 2021

michaelmdresser reviewed Oct 4, 2021

View reviewed changes

Add clarification comment

0d3d64a

michaelmdresser approved these changes Oct 8, 2021

View reviewed changes

kaelanspatel merged commit eb8e55e into develop Oct 8, 2021

This was referenced Oct 19, 2021

Revert "Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM." #965

Merged

Sean/batch schema change #967

Closed

thomasvn mentioned this pull request Nov 4, 2022

Reintroduce gpuRequestAverage and gpuUsageAverage to the Allocation API Schema kubecost/cost-analyzer-helm-chart#1787

Closed

michaelmdresser deleted the kaelan-gpuefficiency branch June 23, 2023 19:02

thomasvn mentioned this pull request May 3, 2024

Reintroduce GPU Usage & Efficiency #2731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

kaelanspatel commented Sep 17, 2021

michaelmdresser left a comment

michaelmdresser Oct 4, 2021

kaelanspatel Oct 4, 2021

michaelmdresser Oct 5, 2021

kaelanspatel Oct 5, 2021

michaelmdresser Oct 6, 2021

michaelmdresser Oct 4, 2021

kaelanspatel Oct 5, 2021

michaelmdresser Oct 5, 2021

Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

Conversation

kaelanspatel commented Sep 17, 2021

What does this PR change?

Does this PR rely on any other PRs?

How does this PR impact users? (This is the kind of thing that goes in release notes!)

Links to Issues or ZD tickets this PR addresses or fixes

How was this PR tested?

michaelmdresser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment