Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

kawych · 2023-02-17T15:31:26Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a resource_name field to scaled_up/down_gpu_nodes_total to differentiate between different types of GPU, which are represented by different custom resource. Credit to @hbostan for the implementation.

Does this PR introduce a user-facing change?

Add breakdown by custom resource name to GPU-related metrics

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

towca · 2023-02-20T15:09:11Z

cluster-autoscaler/metrics/metrics.go

@@ -246,7 +246,7 @@ var (
 			Namespace: caNamespace,
 			Name:      "scaled_up_gpu_nodes_total",
 			Help:      "Number of GPU nodes added by CA, by GPU name.",
-		}, []string{"gpu_name"},
+		}, []string{"resource_name", "gpu_name"},


Since this is specifically for the resource name of a GPU, and will only be set on a GPU scale-up, I'd prefix "resource_name" with "gpu" as well (here and everywhere in this PR)? Otherwise the label name is quite ambiguous - CPU and memory are resources as well and the trigger for most scale-ups, but this won't be set for them.

towca · 2023-02-20T15:20:14Z

cluster-autoscaler/metrics/metrics.go

@@ -270,7 +270,7 @@ var (
 			Namespace: caNamespace,
 			Name:      "scaled_down_gpu_nodes_total",
 			Help:      "Number of GPU nodes removed by CA, by reason and GPU name.",
-		}, []string{"reason", "gpu_name"},
+		}, []string{"reason", "resource_name", "gpu_name"},


Do you know if it's safe to add a new label in the middle of the existing ones (e.g. I could imagine that some metric collector could treat earlier "gpu_name" values as "resource_name" after new 3-value metrics are emitted)?

it's safe, it should be exposed in prometheus format with label names, so that metric collectors have no trouble identifying the right label

towca · 2023-02-20T15:26:09Z

cluster-autoscaler/utils/gpu/gpu.go

-
-		// no signs of GPU
-		return MetricsNoGPU
+func GetGpuTypeForMetrics(gpuConfig *cloudprovider.GpuConfig, availableGPUTypes map[string]struct{}, node *apiv1.Node, nodeGroup cloudprovider.NodeGroup) (string, string) {


The name of the function no longer reflects the returned values, and it's hard to figure out what's returned just from reading the signature. Maybe we could make the name more generic (GetGpuInfoForMetrics?), and name the return values instead?

towca · 2023-02-20T15:36:10Z

cluster-autoscaler/utils/gpu/gpu.go

-		return MetricsNoGPU
+func GetGpuTypeForMetrics(gpuConfig *cloudprovider.GpuConfig, availableGPUTypes map[string]struct{}, node *apiv1.Node, nodeGroup cloudprovider.NodeGroup) (string, string) {
+	// There is no sign of GPU
+	if gpuConfig == nil {


nit: It took me a bit to figure out that the PR doesn't change behavior for the existing GPU logic (there's still 1 difference - this function only looks at capacity, while GetNodeGpuConfig utilizes NodeHasGpu which looks at allocatable - but capacity and allocatable should be in sync for GPUs, and allocatable is arguably more correct - so it looks fine to me). This function could really use a unit test, if you're up for it.

This is a little bit tricky, started writing a test but I found out that previous PR #5459 introduced a "hidden" import cycle (so far the cycle happens only if you use the cloudprovider test package). If you're OK with it, I'd prefer to follow up in the next PR

The cycle will probably have to be solved sooner or later, but a follow-up SGTM.

towca

/lgtm
/approve
/hold

I'd still like to see more "gpu" prefixes in some places since the names seem ambiguous without them. Feel free to unhold if you disagree.

cluster-autoscaler/core/scale_up.go

cluster-autoscaler/metrics/metrics.go

towca · 2023-02-21T17:07:38Z

cluster-autoscaler/utils/gpu/gpu.go

-		return MetricsNoGPU
+func GetGpuTypeForMetrics(gpuConfig *cloudprovider.GpuConfig, availableGPUTypes map[string]struct{}, node *apiv1.Node, nodeGroup cloudprovider.NodeGroup) (string, string) {
+	// There is no sign of GPU
+	if gpuConfig == nil {


The cycle will probably have to be solved sooner or later, but a follow-up SGTM.

k8s-ci-robot · 2023-02-21T17:11:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…nodes_total metrics * Added the new resource_name field to scaled_up/down_gpu_nodes_total, representing the resource name for the gpu. * Changed metrics registrations to use GpuConfig

towca · 2023-02-22T13:11:36Z

/unhold
/lgtm

k8s-ci-robot requested review from BigDarkClown and feiskyer February 17, 2023 15:32

kawych marked this pull request as ready for review February 20, 2023 10:03

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2023

towca requested changes Feb 20, 2023

View reviewed changes

kawych force-pushed the metrics branch from 778e551 to 8f5f97f Compare February 21, 2023 10:19

towca reviewed Feb 21, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2023

k8s-ci-robot assigned towca Feb 21, 2023

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 21, 2023

towca approved these changes Feb 21, 2023

View reviewed changes

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_…

2ea2fb6

…nodes_total metrics * Added the new resource_name field to scaled_up/down_gpu_nodes_total, representing the resource name for the gpu. * Changed metrics registrations to use GpuConfig

kawych force-pushed the metrics branch from 8f5f97f to 2ea2fb6 Compare February 22, 2023 10:10

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2023

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 22, 2023

k8s-ci-robot merged commit c611acd into kubernetes:master Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

kawych commented Feb 17, 2023

towca Feb 20, 2023

kawych Feb 21, 2023

towca Feb 20, 2023

kawych Feb 21, 2023

towca Feb 20, 2023

kawych Feb 21, 2023

towca Feb 20, 2023

kawych Feb 21, 2023

towca Feb 21, 2023

towca left a comment

towca Feb 21, 2023

k8s-ci-robot commented Feb 21, 2023

towca commented Feb 22, 2023

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

Conversation

kawych commented Feb 17, 2023

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

towca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 21, 2023

towca commented Feb 22, 2023