Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[addons/CA] Add support for specifying resources and metrics #10281

Merged

Conversation

dntosas
Copy link
Contributor

@dntosas dntosas commented Nov 20, 2020

In this PR, we make some additions on Cluster Autoscaler Addon Specs

Resources

We enable users to set their desired capacity for cluster-autoscaler addon.

There are edge cases, especially in big clusters, where autoscaler needs
to reconcile a large number of objects thus may need increased memory to
avoid OOMkills or increased cpu to avoid saturation.

Metrics

Cluster autoscaler provides valuable insights for monitoring capacity
allocation and scheduling aspects of a cluster. In this commit, we
enable users to add proper annotation on deployment to scrape metrics
via Prometheus.

Signed-off-by: dntosas ntosas@gmail.com

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 20, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @dntosas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 20, 2020
@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from 79851ef to ba32976 Compare November 20, 2020 19:22
@hakman
Copy link
Member

hakman commented Nov 20, 2020

/ok-to-test
/cc @olemarkus

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 20, 2020
@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from ba32976 to ec2b3e2 Compare November 20, 2020 19:27
@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from ec2b3e2 to 0b98e56 Compare November 20, 2020 19:28
@@ -134,6 +134,12 @@ spec:
metadata:
labels:
app: cluster-autoscaler
{{ if .Metrics }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These annotations are not much in use anymore and we tend not to set them. Instead we expect users to create pod/service monotors themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases like coreDNS where these annotations seem useful. Do you suggest completely drop from here or migrating to a ServiceMonitor? I am cool with both options ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that since Metrics is a pointer you need to use WithDefaultBool.

@olemarkus
Copy link
Member

I am not sure we should set any limits. It is rather dangerous to have CAS crashing due to oom. Maybe it is better to remove limits alltoghether.

@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from 0b98e56 to 6f09f45 Compare November 20, 2020 19:35
@dntosas
Copy link
Contributor Author

dntosas commented Nov 20, 2020

I am not sure we should set any limits. It is rather dangerous to have CAS crashing due to oom. Maybe it is better to remove limits alltoghether.

Will do that ^

Two more thoughts to consider:

  • Do you think is worth make replicas editable?
  • Should we also bump container image patch versions?
  • If yes, could these two included on this PR or we need separate ones?

Sorry for the many questions 😋

@olemarkus
Copy link
Member

Bumping patch versions is good. I didn't do that as I suspect more patch versions will be released before kops 1.20. but feel free to add to this PR.

I don't think making replicas configurable is that useful. But if someone makes a good case for it, I will certainly consider it.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 5, 2020
@olemarkus
Copy link
Member

@dntosas do you have time to look at wrapping this one up?

@dntosas
Copy link
Contributor Author

dntosas commented Jan 10, 2021

@dntosas do you have time to look at wrapping this one up?

sure ^^ i let it stale because i didn't understand if we wanted to close this PR as not needed

@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from 6f09f45 to 48be140 Compare January 10, 2021 22:02
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 10, 2021
@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from 48be140 to d1db788 Compare January 10, 2021 22:04
@k8s-ci-robot k8s-ci-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 10, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 10, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 10, 2021

@dntosas: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kops-verify-hashes 6f09f45 link /test pull-kops-verify-hashes

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from d1db788 to ac11e07 Compare January 10, 2021 22:51
Copy link
Member

@olemarkus olemarkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. My intention was to only provide feedback on improving this.

@@ -134,6 +134,12 @@ spec:
metadata:
labels:
app: cluster-autoscaler
{{ if WithDefaultBool .Metrics false }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only thing this flag does is add the annotation, it is not that useful. You could just add the annotations without the flag. Typically such a flag would be used for enabling/disabling CAS serving metrics in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed flag also ^^

@@ -83,6 +83,9 @@ func (b *ClusterAutoscalerOptionsBuilder) BuildOptions(o interface{}) error {
if cas.NewPodScaleUpDelay == nil {
cas.NewPodScaleUpDelay = fi.String("0s")
}
if cas.Metrics == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already default this one to false in the manifest. So you do not need this bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it ^^

@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch 2 times, most recently from dc0cc6b to 8b7fb57 Compare January 11, 2021 17:53
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 11, 2021
skipNodesWithSystemPods: true
cpuRequest: "100m"
memoryRequest: "300Mi"
metrics: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this one, we should be good to go

- Resources
We enable users to set their desired capacity for cluster-autoscaler addon.
There are edge cases, especially in big clusters, where autoscaler needs
to reconcile a large number of objects thus may need increased memory or
increased cpu to avoid saturation.

- Metrics
Cluster autoscaler provides valuable insights for monitoring capacity
allocation and scheduling aspects of a cluster. In this commit, we
add proper annotation on deployment to enable Prometheus scrape metrics.

We also bump patch version of container images.

Signed-off-by: dntosas <ntosas@gmail.com>
@dntosas dntosas force-pushed the cluster-autoscaler-improvements branch from 8b7fb57 to 56fe4ba Compare January 11, 2021 18:53
@olemarkus
Copy link
Member

Cool. Thanks
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dntosas, olemarkus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2021
@k8s-ci-robot k8s-ci-robot merged commit 695be26 into kubernetes:master Jan 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Jan 11, 2021
@dntosas dntosas deleted the cluster-autoscaler-improvements branch January 11, 2021 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/addons area/api area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants