New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010
Conversation
Subsystem: HPAControllerSubsystem, | ||
Name: "reconciliation_duration_seconds", | ||
Help: "The time(seconds) that the HPA controller takes to reconcile once. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.", | ||
Buckets: metrics.ExponentialBuckets(0.001, 2, 15), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's suitable or not.
@pbarker I'm wondering if the GKE team has some data on it. How do you think the duration's bucket should be defined?
For sig-instrumentation people, it's first time for the HPA controller to expose the metrics from it and I may miss something fundamental. Please take a look. 🙏 |
/assign |
/test pull-kubernetes-e2e-autoscaling-hpa-cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I like the approach.
I've added a couple of comments.
@@ -0,0 +1,17 @@ | |||
package metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the "metrics" term is also part of our domain I would avoid using metrics here. How about monitoring
ormonitor
? This way we avoid ambiguous metrics
imported name in horizontal.go
.
If you agree, then:
- The path could change into
pkg/controller/podautoscaler/monitoring/monitor.go
MetricsRecorder
could change intoReconciliationMonitor
(alternatively justMonitor
)ObserveReconciliationResult
could change intoRecordResult
(ifMonitor
above thenRecordReconciliationResult
)
"k8s.io/component-base/metrics/legacyregistry" | ||
) | ||
|
||
type ReconciliationAction string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all recordings come from MetricsRecorder
I would:
- move all public names like
ReconciliationAction
to recorder file. - lowercase all private names like
HPAControllerSubsystem
and prometheus metric vars
@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp | |||
return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError) | |||
} | |||
setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric) | |||
return replicas, metric, statuses, timestamp, nil | |||
|
|||
return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to silence errors on scale-ups? Good that we're fixing this 👍
// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler. | ||
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors. | ||
// e.g., fmt.Errorf("invalid spec%w", errSpec) | ||
errSpec error = errors.New("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should rather be imported somewhere up top, since we're referring to this variable in line 307, right?
reconciliationError = metrics.ReconciliationErrorSpec | ||
} | ||
|
||
a.metricsRecorder.ObserveReconciliationResult(reconciliationAction, reconciliationError, time.Since(start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ObserveReconciliationResult
could accept time.Duration
as the last argument and transform into seconds on its own. The fact that the prometheus histogram operates on seconds seems like an implementation detail.
@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp | |||
return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError) | |||
} | |||
setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric) | |||
return replicas, metric, statuses, timestamp, nil | |||
|
|||
return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "first error" is a bit of a lie, because there could be spec error first which is later overridden by internal error.
Now that I think of it perhaps overriding spec with internal is bringing more confusion than value. Perhaps it's good enough and simpler to report in metrics always the first error, regardless if it's spec or internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps overriding spec with internal is bringing more confusion than value.
Yep, agree. will fix as your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the current impl, "unknown metric source type" is only the spec error we can have here. And actually such a invalid metric source will be validated in the validation of api-server, meaning no spec error can reach here.
https://github.com/kubernetes/kubernetes/blob/04d52a56fd40a72ad9a01bc5b692600fb5ad0097/pkg/apis/autoscaling/validation/validation.go#L327
So, we can ignore spec error case here either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually such a invalid metric source will be validated in the validation of api-server
We may be able to say the same thing for all errSpec
tho :)
&metrics.CounterOpts{ | ||
Subsystem: HPAControllerSubsystem, | ||
Name: "reconciliations_total", | ||
Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.", | |
Help: "Number of reconciliations of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.", |
// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent. | ||
ReconciliationErrorSpec ReconciliationError = "spec" | ||
// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object. | ||
ReconciliationErrorInternal ReconciliationError = "internal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent. | |
ReconciliationErrorSpec ReconciliationError = "spec" | |
// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object. | |
ReconciliationErrorInternal ReconciliationError = "internal" | |
// ReconciliationErrorSpec means the HPA reconciliation has an error due to an invalid spec of HPA object. | |
ReconciliationErrorSpec ReconciliationError = "spec" | |
// ReconciliationErrorInternal means the HPA reconciliation has an error from an internal computation or communication with other component. | |
ReconciliationErrorInternal ReconciliationError = "internal" |
func (a *HorizontalController) reconcileAutoscaler(ctx context.Context, hpaShared *autoscalingv2.HorizontalPodAutoscaler, key string) error { | ||
var ( | ||
// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler. | ||
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors. | |
// All such errors should have this error as a root error so that the upstream function can distinguish spec errors from internal errors. |
@pbetkier Thanks, I fixed some parts based on your suggestions 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
(from instrumentation)
LGTM label has been added. Git tree hash: 7c98af0b144443a6671e76721a039b211a3070b8
|
/retest |
/label tide/merge-method-squash |
8477d49
to
ef4e92a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963
|
Actually, please squash the commits. 10 is too much. |
…uration_seconds" from HPA controller
ef4e92a
to
b110451
Compare
@mwielgus squashed. Please review it again. 🙏 |
@sanposhiho: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: logicalhan, mwielgus, pbetkier, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…uration_seconds" from HPA controller (kubernetes#116010)
What type of PR is this?
/kind feature
/sig autoscaling
/sig instrumentation
What this PR does / why we need it:
implement/expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller.
reconciliations_total
: Number of reconciliation of HPA controller.reconciliation_duration_seconds
: The time(seconds) that the HPA controller takes to reconcile once.These are must-to-have to meat the requirements from the production readiness review of the container resource metrics.
Which issue(s) this PR fixes:
Related #115639
Related kubernetes/enhancements#1610
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: