Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA: expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller #116010

Merged
merged 1 commit into from Mar 14, 2023

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Feb 23, 2023

What type of PR is this?

/kind feature
/sig autoscaling
/sig instrumentation

What this PR does / why we need it:

implement/expose the metrics "reconciliations_total" and "reconciliation_duration_seconds" from HPA controller.

  • reconciliations_total: Number of reconciliation of HPA controller.
  • reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.

These are must-to-have to meat the requirements from the production readiness review of the container resource metrics.

Which issue(s) this PR fixes:

Related #115639
Related kubernetes/enhancements#1610

Special notes for your reviewer:

Does this PR introduce a user-facing change?

HPA controller starts to expose metrics from the kube-controller-manager.
- `reconciliations_total`: Number of reconciliation of HPA controller. 
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2023
Subsystem: HPAControllerSubsystem,
Name: "reconciliation_duration_seconds",
Help: "The time(seconds) that the HPA controller takes to reconcile once. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",
Buckets: metrics.ExponentialBuckets(0.001, 2, 15),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's suitable or not.
@pbarker I'm wondering if the GKE team has some data on it. How do you think the duration's bucket should be defined?

@sanposhiho
Copy link
Member Author

/cc @pbetkier @mwielgus @gjtempleton

@sanposhiho
Copy link
Member Author

For sig-instrumentation people, it's first time for the HPA controller to expose the metrics from it and I may miss something fundamental. Please take a look. 🙏

@logicalhan
Copy link
Member

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2023
@pbetkier
Copy link
Contributor

/test pull-kubernetes-e2e-autoscaling-hpa-cpu
/test pull-kubernetes-e2e-autoscaling-hpa-cm

Copy link
Contributor

@pbetkier pbetkier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I like the approach.

I've added a couple of comments.

@@ -0,0 +1,17 @@
package metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the "metrics" term is also part of our domain I would avoid using metrics here. How about monitoring ormonitor? This way we avoid ambiguous metrics imported name in horizontal.go.

If you agree, then:

  • The path could change into pkg/controller/podautoscaler/monitoring/monitor.go
  • MetricsRecorder could change into ReconciliationMonitor (alternatively just Monitor)
  • ObserveReconciliationResult could change into RecordResult (if Monitor above then RecordReconciliationResult)

"k8s.io/component-base/metrics/legacyregistry"
)

type ReconciliationAction string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all recordings come from MetricsRecorder I would:

  • move all public names like ReconciliationAction to recorder file.
  • lowercase all private names like HPAControllerSubsystem and prometheus metric vars

@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp
return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
}
setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric)
return replicas, metric, statuses, timestamp, nil

return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to silence errors on scale-ups? Good that we're fixing this 👍

// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler.
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.
// e.g., fmt.Errorf("invalid spec%w", errSpec)
errSpec error = errors.New("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should rather be imported somewhere up top, since we're referring to this variable in line 307, right?

reconciliationError = metrics.ReconciliationErrorSpec
}

a.metricsRecorder.ObserveReconciliationResult(reconciliationAction, reconciliationError, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ObserveReconciliationResult could accept time.Duration as the last argument and transform into seconds on its own. The fact that the prometheus histogram operates on seconds seems like an implementation detail.

@@ -315,7 +327,8 @@ func (a *HorizontalController) computeReplicasForMetrics(ctx context.Context, hp
return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
}
setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric)
return replicas, metric, statuses, timestamp, nil

return replicas, metric, statuses, timestamp, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "first error" is a bit of a lie, because there could be spec error first which is later overridden by internal error.

Now that I think of it perhaps overriding spec with internal is bringing more confusion than value. Perhaps it's good enough and simpler to report in metrics always the first error, regardless if it's spec or internal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps overriding spec with internal is bringing more confusion than value.

Yep, agree. will fix as your suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the current impl, "unknown metric source type" is only the spec error we can have here. And actually such a invalid metric source will be validated in the validation of api-server, meaning no spec error can reach here.
https://github.com/kubernetes/kubernetes/blob/04d52a56fd40a72ad9a01bc5b692600fb5ad0097/pkg/apis/autoscaling/validation/validation.go#L327
So, we can ignore spec error case here either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually such a invalid metric source will be validated in the validation of api-server

We may be able to say the same thing for all errSpec tho :)

&metrics.CounterOpts{
Subsystem: HPAControllerSubsystem,
Name: "reconciliations_total",
Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Help: "Number of reconciliation of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",
Help: "Number of reconciliations of HPA controller. The label 'action' should be either 'scale_down', 'scale_up', or 'none. Also, the label 'error' should be either 'spec', 'internal', 'none'. Note that if both 'spec' and 'internal' happens during one reconciliation, it's counted as a 'internal' error.",

Comment on lines 38 to 41
// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent.
ReconciliationErrorSpec ReconciliationError = "spec"
// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object.
ReconciliationErrorInternal ReconciliationError = "internal"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ReconciliationErrorSpec means the HPA reconciliation has an error from the internal accedent.
ReconciliationErrorSpec ReconciliationError = "spec"
// ReconciliationErrorInternal means the HPA reconciliation has an error due to the spec of HPA object.
ReconciliationErrorInternal ReconciliationError = "internal"
// ReconciliationErrorSpec means the HPA reconciliation has an error due to an invalid spec of HPA object.
ReconciliationErrorSpec ReconciliationError = "spec"
// ReconciliationErrorInternal means the HPA reconciliation has an error from an internal computation or communication with other component.
ReconciliationErrorInternal ReconciliationError = "internal"

func (a *HorizontalController) reconcileAutoscaler(ctx context.Context, hpaShared *autoscalingv2.HorizontalPodAutoscaler, key string) error {
var (
// errSpec is used to determine if the error comes from the spec of HPA object in reconcileAutoscaler.
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// All such errors should have this error as a root error so that the upstream function can understand they're spec errors.
// All such errors should have this error as a root error so that the upstream function can distinguish spec errors from internal errors.

@sanposhiho
Copy link
Member Author

@pbetkier Thanks, I fixed some parts based on your suggestions 🙏

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
(from instrumentation)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7c98af0b144443a6671e76721a039b211a3070b8

@sanposhiho
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2023
@sanposhiho
Copy link
Member Author

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 14, 2023
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2023
@mwielgus mwielgus removed approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 14, 2023
@mwielgus
Copy link
Contributor

Actually, please squash the commits. 10 is too much.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2023
@sanposhiho
Copy link
Member Author

@mwielgus squashed. Please review it again. 🙏

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 14, 2023

@sanposhiho: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-autoscaling-hpa-cm ef4e92ab2e87e0e57e99e4636a2cdd56a8729986 link false /test pull-kubernetes-e2e-autoscaling-hpa-cm
pull-kubernetes-e2e-autoscaling-hpa-cpu ef4e92ab2e87e0e57e99e4636a2cdd56a8729986 link false /test pull-kubernetes-e2e-autoscaling-hpa-cpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sanposhiho
Copy link
Member Author

/retest

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 78752206b1f80beb947ec45f1547a4d08adc8963

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: logicalhan, mwielgus, pbetkier, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b49b34c into kubernetes:master Mar 14, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Mar 14, 2023
rayowang pushed a commit to rayowang/kubernetes that referenced this pull request Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants