-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments. I'd also like to see a documentation update (allocation.md, probably) that highlights the new fields and why usage will be 0 (and efficiency 0) when the NVIDIA metrics are not available.
Also, @mbolt35 is this the correct way of version-bumping the schema now? I haven't had to do it since we split the Asset and Allocation versions.
@@ -954,6 +960,33 @@ func applyGPUsAllocated(podMap map[podKey]*Pod, resGPUsRequested []*prom.QueryRe | |||
|
|||
hrs := pod.Allocations[container].Minutes() / 60.0 | |||
pod.Allocations[container].GPUHours = res.Values[0].Value * hrs | |||
pod.Allocations[container].GPURequestAverage = res.Values[0].Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an actual average? Should we be querying this more like CPU/RAM request averages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. Can you expand on this? To my knowledge it is being queried and handled like RAM requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does some funky stuff with queryFmtGPUsAllocated
-- look at the first if
statement in this function, I think you might run into trouble because requests is a separate array and can get overwritten by allocation.
However, I'm not super familiar with GPU. This might not be a problem at all, in which case 👍
Other resources (RAM, CPU, etc) have separate function for applying requests, usage, and average. We don't need to adhere to that style if this is correct, its just what tipped me off that this looked a little odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I see. I believe this is consistent with what we do throughout, which is override requests with limits (which should also be what is happening in any pod using a GPU regardless; you're just supposed to have limits).
There may be problems with this particular function that I am not aware of, but the value of "GPUs allocated per allocation" right now is back-calculated from GPUHours
anyway, so really all this change does is make that a separate field in Allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
pod.AppendContainer(container) | ||
} | ||
|
||
pod.Allocations[container].GPUUsageAverage = res.Values[0].Value * 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this multiplied by 0.01? A comment would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add one. For posterity, it's because (for whatever reason) the "percentage" expressed by this metric is in whole numbers, rather than the decimals we're using for CPU/RAM i.e. 100% is 100
, not 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, of course. Excellent.
What does this PR change?
Changes the
Allocation
struct to features new fieldsGPURequestAverage
andGPUUsageAverage
, the later of which is supported through a Prometheus query for the Nvidia dcgm-exporter metricDCGM_FI_DEV_GPU_UTIL
. Also addsAllocation.GPUEfficiency()
, which calculates the ratio of GPU usage to requests, as with CPU and RAM efficiencies.Notes
dcgm-exporter
pod is running on the node. Otherwise, the valueGPUUsageAverage
will be zero. This means that even if you are utilizing a GPU by allocating it, you may have zero usage. Setup is not entirely trivial and is detailed here.Does this PR rely on any other PRs?
How does this PR impact users? (This is the kind of thing that goes in release notes!)
Adds support for integrating Nvidia DCGM GPU metrics into Kubecost and calculating Allocation GPU efficiency.
Links to Issues or ZD tickets this PR addresses or fixes
How was this PR tested?
Updated
allocation_test.go
to feature checks for newAllocation
fields, including efficiency. Additionally, created a gpu test load generator pod on cluster and observed metrics in Prometheus + through the api.Prometheus shows
DCGM_FI_DEV_GPU_UTIL
being autodetected and whitelisted:Allocation
for load-generating pod shows consistentGPURequestAverage
andGPUUsageAverage
: