Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944

Merged
merged 7 commits into from
Oct 8, 2021

Conversation

kaelanspatel
Copy link
Contributor

What does this PR change?

Changes the Allocation struct to features new fields GPURequestAverage and GPUUsageAverage, the later of which is supported through a Prometheus query for the Nvidia dcgm-exporter metric DCGM_FI_DEV_GPU_UTIL. Also adds Allocation.GPUEfficiency(), which calculates the ratio of GPU usage to requests, as with CPU and RAM efficiencies.

Notes

  • These features will only work if the Nvidia dcgm-exporter pod is running on the node. Otherwise, the value GPUUsageAverage will be zero. This means that even if you are utilizing a GPU by allocating it, you may have zero usage. Setup is not entirely trivial and is detailed here.

Does this PR rely on any other PRs?

How does this PR impact users? (This is the kind of thing that goes in release notes!)

Adds support for integrating Nvidia DCGM GPU metrics into Kubecost and calculating Allocation GPU efficiency.

Links to Issues or ZD tickets this PR addresses or fixes

How was this PR tested?

Updated allocation_test.go to feature checks for new Allocation fields, including efficiency. Additionally, created a gpu test load generator pod on cluster and observed metrics in Prometheus + through the api.

Prometheus shows DCGM_FI_DEV_GPU_UTIL being autodetected and whitelisted:
gpu_prom

Allocation for load-generating pod shows consistent GPURequestAverage and GPUUsageAverage:
image

@kaelanspatel kaelanspatel added the schema change Change to ETL schema(s), requiring a rebuild label Sep 23, 2021
Copy link
Contributor

@michaelmdresser michaelmdresser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. I'd also like to see a documentation update (allocation.md, probably) that highlights the new fields and why usage will be 0 (and efficiency 0) when the NVIDIA metrics are not available.

Also, @mbolt35 is this the correct way of version-bumping the schema now? I haven't had to do it since we split the Asset and Allocation versions.

@@ -954,6 +960,33 @@ func applyGPUsAllocated(podMap map[podKey]*Pod, resGPUsRequested []*prom.QueryRe

hrs := pod.Allocations[container].Minutes() / 60.0
pod.Allocations[container].GPUHours = res.Values[0].Value * hrs
pod.Allocations[container].GPURequestAverage = res.Values[0].Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an actual average? Should we be querying this more like CPU/RAM request averages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. Can you expand on this? To my knowledge it is being queried and handled like RAM requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does some funky stuff with queryFmtGPUsAllocated -- look at the first if statement in this function, I think you might run into trouble because requests is a separate array and can get overwritten by allocation.

However, I'm not super familiar with GPU. This might not be a problem at all, in which case 👍

Other resources (RAM, CPU, etc) have separate function for applying requests, usage, and average. We don't need to adhere to that style if this is correct, its just what tipped me off that this looked a little odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I see. I believe this is consistent with what we do throughout, which is override requests with limits (which should also be what is happening in any pod using a GPU regardless; you're just supposed to have limits).

There may be problems with this particular function that I am not aware of, but the value of "GPUs allocated per allocation" right now is back-calculated from GPUHours anyway, so really all this change does is make that a separate field in Allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

pod.AppendContainer(container)
}

pod.Allocations[container].GPUUsageAverage = res.Values[0].Value * 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this multiplied by 0.01? A comment would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one. For posterity, it's because (for whatever reason) the "percentage" expressed by this metric is in whole numbers, rather than the decimals we're using for CPU/RAM i.e. 100% is 100, not 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, of course. Excellent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema change Change to ETL schema(s), requiring a rebuild
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants