HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

sanposhiho · 2023-02-23T13:33:28Z

What type of PR is this?

/kind feature
/sig autoscaling
/sig instrumentation

What this PR does / why we need it:

implement/expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller.

reconciliations_total: Number of reconciliation of HPA controller.
reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.

These are must-to-have to meat the requirements from the production readiness review of the container resource metrics.

Which issue(s) this PR fixes:

Related #115639
Related kubernetes/enhancements#1610

Special notes for your reviewer:

Does this PR introduce a user-facing change?

HPA controller starts to expose metrics from the kube-controller-manager.
- `reconciliations_total`: Number of reconciliation of HPA controller. 
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

sanposhiho · 2023-02-23T13:38:19Z

pkg/controller/podautoscaler/controller_metrics/metrics.go

+			Subsystem:      HPAControllerSubsystem,
+			Name:           "reconciliation_duration_seconds",
+			Help:           "The time(seconds) that the HPA controller takes to reconcile once. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",
+			Buckets:        metrics.ExponentialBuckets(0.001, 2, 15),


I'm not sure if it's suitable or not.
@pbarker I'm wondering if the GKE team has some data on it. How do you think the duration's bucket should be defined?

sanposhiho · 2023-02-23T13:41:14Z

/cc @pbetkier @mwielgus @gjtempleton

sanposhiho · 2023-02-23T13:45:20Z

For sig-instrumentation people, it's first time for the HPA controller to expose the metrics from it and I may miss something fundamental. Please take a look. 🙏

logicalhan · 2023-02-23T17:34:31Z

/assign
/triage accepted

pbetkier · 2023-02-24T08:23:23Z

/test pull-kubernetes-e2e-autoscaling-hpa-cpu
/test pull-kubernetes-e2e-autoscaling-hpa-cm

pbetkier

Thanks! I like the approach.

I've added a couple of comments.

pbetkier · 2023-02-24T09:19:25Z

pkg/controller/podautoscaler/controller_metrics/metric_recorder.go

@@ -0,0 +1,17 @@
+package metrics


Since the "metrics" term is also part of our domain I would avoid using metrics here. How about monitoring ormonitor? This way we avoid ambiguous metrics imported name in horizontal.go.

If you agree, then:

The path could change into pkg/controller/podautoscaler/monitoring/monitor.go

MetricsRecorder could change into ReconciliationMonitor (alternatively just Monitor)

ObserveReconciliationResult could change into RecordResult (if Monitor above then RecordReconciliationResult)

pbetkier · 2023-02-24T09:21:36Z

pkg/controller/podautoscaler/controller_metrics/metrics.go

+	"k8s.io/component-base/metrics/legacyregistry"
+)
+
+type ReconciliationAction string


Since all recordings come from MetricsRecorder I would:

move all public names like ReconciliationAction to recorder file.

lowercase all private names like HPAControllerSubsystem and prometheus metric vars

pbetkier · 2023-02-24T09:28:10Z

pkg/controller/podautoscaler/horizontal.go

@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp
 		return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
 	}
 	setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric)
-	return replicas, metric, statuses, timestamp, nil
+
+	return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)


We used to silence errors on scale-ups? Good that we're fixing this 👍

pbetkier · 2023-02-24T09:31:37Z

pkg/controller/podautoscaler/horizontal.go

+	// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler.
+	// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.
+	// e.g., fmt.Errorf("invalid spec%w", errSpec)
+	errSpec error = errors.New("")


This should rather be imported somewhere up top, since we're referring to this variable in line 307, right?

pbetkier · 2023-02-24T09:34:58Z

pkg/controller/podautoscaler/horizontal.go

+			reconciliationError = metrics.ReconciliationErrorSpec
+		}
+
+		a.metricsRecorder.ObserveReconciliationResult(reconciliationAction, reconciliationError, time.Since(start).Seconds())


nit: ObserveReconciliationResult could accept time.Duration as the last argument and transform into seconds on its own. The fact that the prometheus histogram operates on seconds seems like an implementation detail.

pbetkier · 2023-02-24T09:39:02Z

pkg/controller/podautoscaler/horizontal.go

@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp
 		return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
 	}
 	setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric)
-	return replicas, metric, statuses, timestamp, nil
+
+	return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)


The "first error" is a bit of a lie, because there could be spec error first which is later overridden by internal error.

Now that I think of it perhaps overriding spec with internal is bringing more confusion than value. Perhaps it's good enough and simpler to report in metrics always the first error, regardless if it's spec or internal.

perhaps overriding spec with internal is bringing more confusion than value.

Yep, agree. will fix as your suggestion.

For the current impl, "unknown metric source type" is only the spec error we can have here. And actually such a invalid metric source will be validated in the validation of api-server, meaning no spec error can reach here.
https://github.com/kubernetes/kubernetes/blob/04d52a56fd40a72ad9a01bc5b692600fb5ad0097/pkg/apis/autoscaling/validation/validation.go#L327
So, we can ignore spec error case here either way.

actually such a invalid metric source will be validated in the validation of api-server

We may be able to say the same thing for all errSpec tho :)

pbetkier · 2023-02-24T09:40:58Z

pkg/controller/podautoscaler/controller_metrics/metrics.go

+		&metrics.CounterOpts{
+			Subsystem:      HPAControllerSubsystem,
+			Name:           "reconciliations_total",
+			Help:           "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",


Suggested change

Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",

Help: "Number of reconciliations of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",

pbetkier · 2023-02-24T09:43:00Z

pkg/controller/podautoscaler/controller_metrics/metrics.go

+	// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent.
+	ReconciliationErrorSpec ReconciliationError = "spec"
+	// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object.
+	ReconciliationErrorInternal ReconciliationError = "internal"


Suggested change

// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent.

ReconciliationErrorSpec ReconciliationError = "spec"

// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object.

ReconciliationErrorInternal ReconciliationError = "internal"

// ReconciliationErrorSpec means the HPA reconciliation has an error due to an invalid spec of HPA object.

ReconciliationErrorSpec ReconciliationError = "spec"

// ReconciliationErrorInternal means the HPA reconciliation has an error from an internal computation or communication with other component.

ReconciliationErrorInternal ReconciliationError = "internal"

pbetkier · 2023-02-24T09:43:49Z

pkg/controller/podautoscaler/horizontal.go

-func (a *HorizontalController) reconcileAutoscaler(ctx context.Context, hpaShared *autoscalingv2.HorizontalPodAutoscaler, key string) error {
+var (
+	// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler.
+	// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.


Suggested change

// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.

// All such errors should have this error as a root error so that the upstream function can distinguish spec errors from internal errors.

sanposhiho · 2023-02-24T14:20:15Z

@pbetkier Thanks, I fixed some parts based on your suggestions 🙏

logicalhan

/lgtm
/approve
(from instrumentation)

k8s-ci-robot · 2023-02-24T17:04:37Z

LGTM label has been added.

Git tree hash: 7c98af0b144443a6671e76721a039b211a3070b8

sanposhiho · 2023-02-25T01:54:07Z

/retest

sanposhiho · 2023-03-14T00:34:36Z

/label tide/merge-method-squash

mwielgus

/lgtm
/approve

k8s-ci-robot · 2023-03-14T09:42:58Z

LGTM label has been added.

Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963

mwielgus · 2023-03-14T10:22:39Z

Actually, please squash the commits. 10 is too much.

…uration_seconds" from HPA controller

sanposhiho · 2023-03-14T14:15:26Z

@mwielgus squashed. Please review it again. 🙏

k8s-ci-robot · 2023-03-14T14:38:09Z

@sanposhiho: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-autoscaling-hpa-cm	ef4e92ab2e87e0e57e99e4636a2cdd56a8729986	link	false	`/test pull-kubernetes-e2e-autoscaling-hpa-cm`
pull-kubernetes-e2e-autoscaling-hpa-cpu	ef4e92ab2e87e0e57e99e4636a2cdd56a8729986	link	false	`/test pull-kubernetes-e2e-autoscaling-hpa-cpu`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sanposhiho · 2023-03-14T14:38:27Z

/retest

mwielgus

/lgtm
/approve

k8s-ci-robot · 2023-03-14T14:43:51Z

LGTM label has been added.

Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963

k8s-ci-robot · 2023-03-14T14:44:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: logicalhan, mwielgus, pbetkier, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/podautoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…uration_seconds" from HPA controller (kubernetes#116010)

k8s-ci-robot requested review from josephburnett and mwielgus February 23, 2023 13:34

sanposhiho commented Feb 23, 2023

View reviewed changes

sanposhiho mentioned this pull request Feb 23, 2023

expose some metrics from HPA controller #115639

Open

k8s-ci-robot requested review from gjtempleton and pbetkier February 23, 2023 13:41

k8s-ci-robot assigned logicalhan Feb 23, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2023

pbetkier reviewed Feb 24, 2023

View reviewed changes

logicalhan reviewed Feb 24, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2023

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2023

k8s-ci-robot requested a review from logicalhan February 25, 2023 10:38

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 14, 2023

sanposhiho force-pushed the hpa-metric-server branch from 8477d49 to ef4e92a Compare March 14, 2023 01:40

mwielgus approved these changes Mar 14, 2023

View reviewed changes

k8s-ci-robot assigned mwielgus Mar 14, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2023

mwielgus removed approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 14, 2023

HPA: expose the metrics "reconciliations_total" and "reconciliation_d…

b110451

…uration_seconds" from HPA controller

sanposhiho force-pushed the hpa-metric-server branch from ef4e92a to b110451 Compare March 14, 2023 14:13

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2023

mwielgus approved these changes Mar 14, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2023

k8s-ci-robot merged commit b49b34c into kubernetes:master Mar 14, 2023

k8s-ci-robot added this to the v1.27 milestone Mar 14, 2023

sanposhiho mentioned this pull request Mar 15, 2023

feature(hpa): beta graduation for the container resource metrics #116046

Merged

rayowang pushed a commit to rayowang/kubernetes that referenced this pull request Feb 9, 2024

HPA: expose the metrics "reconciliations_total" and "reconciliation_d…

08bb499

…uration_seconds" from HPA controller (kubernetes#116010)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

sanposhiho commented Feb 23, 2023 •

edited

sanposhiho Feb 23, 2023

sanposhiho commented Feb 23, 2023

sanposhiho commented Feb 23, 2023

logicalhan commented Feb 23, 2023

pbetkier commented Feb 24, 2023

pbetkier left a comment

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

sanposhiho Feb 24, 2023

sanposhiho Feb 24, 2023

sanposhiho Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

pbetkier Feb 24, 2023

sanposhiho commented Feb 24, 2023

logicalhan left a comment

k8s-ci-robot commented Feb 24, 2023

sanposhiho commented Feb 25, 2023

sanposhiho commented Mar 14, 2023

mwielgus left a comment

k8s-ci-robot commented Mar 14, 2023

mwielgus commented Mar 14, 2023

sanposhiho commented Mar 14, 2023

k8s-ci-robot commented Mar 14, 2023 •

edited

sanposhiho commented Mar 14, 2023

mwielgus left a comment

k8s-ci-robot commented Mar 14, 2023

k8s-ci-robot commented Mar 14, 2023

	Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",
	Help: "Number of reconciliations of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",

	// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.
	// All such errors should have this error as a root error so that the upstream function can distinguish spec errors from internal errors.

HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

Conversation

sanposhiho commented Feb 23, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

sanposhiho commented Feb 23, 2023

sanposhiho commented Feb 23, 2023

logicalhan commented Feb 23, 2023

pbetkier commented Feb 24, 2023

pbetkier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Feb 24, 2023

logicalhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 24, 2023

sanposhiho commented Feb 25, 2023

sanposhiho commented Mar 14, 2023

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 14, 2023

mwielgus commented Mar 14, 2023

sanposhiho commented Mar 14, 2023

k8s-ci-robot commented Mar 14, 2023 • edited

sanposhiho commented Mar 14, 2023

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 14, 2023

k8s-ci-robot commented Mar 14, 2023

sanposhiho commented Feb 23, 2023 •

edited

k8s-ci-robot commented Mar 14, 2023 •

edited