Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1610: graduate ContainerResource to stable #4406

Merged
merged 3 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-autoscaling/1610.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 1610
beta:
approver: "@johnbelamaric"
stable:
approver: "@johnbelamaric"
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
49 changes: 32 additions & 17 deletions keps/sig-autoscaling/1610-container-resource-autoscaling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,7 @@ You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->

No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/123189))

### Rollout, Upgrade and Rollback Planning

Expand Down Expand Up @@ -767,12 +767,14 @@ What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

- The container resource metric takes much longer time compared to other metrics.
which can be monitored via the 1st metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- Increase the overall performance of HPA controller
which can be monitored via the 2nd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- Many error occurrence on the container resource metrics
which can be monitored via the 3rd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
- You should rollback if you see an increase in the overall performance of HPA controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either "decrease in overall performance" or "increase in overall reconciliation duration". Probably the latter. "increase in performance" sounds like "going faster"...

- `metric_computation_duration_seconds{metric_type=ContainerResource}`: The time(seconds) that the HPA controller takes to calculate one metric.
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
- `reconciliations_total{error=internal}`: Number of internal errors in reconciliation of HPA controller.
- You should rollback if you see many error occurrence on the reconciliation.
- `metric_computation_total{error=internal,{metric_type=ContainerResource}`: Number of internal errors in the calculation of `type: ContainerResource`.
- You should rollback if you see many error occurrence on the container resource metrics

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -782,7 +784,6 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Not yet.
But, as described in [Are there any tests for feature enablement/disablement?](#Are-there-any-tests-for-feature-enablement/disablement?), the tests to confirm the behavior on switching the feature gate will be added. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Expand Down Expand Up @@ -826,7 +827,6 @@ Recall that end users cannot usually observe component logs or access metrics.

- [x] Events
- `SuccessfulRescale` event with `memory/cpu/etc resource utilization (percentage of request) above/below target`
- Note that we cannot know if this reason is due to the `Resource` metric or `ContainerResource` in the current implementation. We'll change this reason for `ContainerResource` to `memory/cpu/etc container resource utilization (percentage of request) above/below target` so that we can distinguish.
- [x] API .status
- When something wrong with the container metrics, `ScalingActive` condition will be false with `FailedGetContainerResourceMetric` reason.

Expand Down Expand Up @@ -861,14 +861,11 @@ Pick one more of these and delete the rest.
- Details:
-->

HPA controller have no metrics in it now.
The following metrics will be implemented by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115639))
1. How long does each metric type take to compute the ideal replica num.
- so that users can confirm the container resource metric doesn't take long time compared to other metrics.
2. How long does the HPA controller take to complete reconcile one HPA object.
- so that users can confirm the container resource metric doesn't increse the whole time of scaling.
3. Provide the metric to show error occurrence for each metric.
- so that users can confirm no much error occurrence on the container resource metric.
- [x] Metrics
- `metric_computation_duration_seconds`: The time(seconds) that the HPA controller takes to calculate one metric.
- `metric_computation_total`: Number of metric computations.
- `reconciliations_total`: Number of reconciliation of HPA controller.
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Expand Down Expand Up @@ -996,6 +993,20 @@ This through this both in small and large cases, again with respect to the

No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved

<!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?

Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->

No.

### Troubleshooting

<!--
Expand Down Expand Up @@ -1037,6 +1048,10 @@ For each of them, fill in the following information by copying the below templat

###### What steps should be taken if SLOs are not being met to determine the problem?

Check `metric_computation_duration_seconds` or `reconciliation_duration_seconds` to see which metric encountered the latency issue.
And, if it is a latency problem only specific in `type: ContainerResource`,
you can opt-out this feature by removing the `type: ContainerResource` metric from HPA(s).

## Implementation History

* 2020-04-03 Initial KEP merged
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ approvers:
- "@josephburnett"
- "@gjtempleton"
creation-date: 2020-02-18
last-updated: 2023-02-02
last-updated: 2024-01-15
status: implementable

latest-milestone: "1.27"
stage: "beta"
latest-milestone: "1.30"
stage: "stable"

milestone:
alpha: "v1.20"
beta: "v1.27"
stable: "v1.29"
stable: "v1.30"

feature-gates:
- name: HPAContainerMetrics
Expand Down