-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209
Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209
Conversation
This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.
Now supported by magnum. https://review.opendev.org/c/openstack/magnum/+/737580/ If using node group autodiscovery, older versions of magnum will still forbid scaling to zero or setting the minimum node count to zero.
Force refreshing everything at every DeleteNodes calls causes slow down and throttling on large clusters with many ASGs (and lot of activity). That function might be called several times in a row during scale-down (once for each ASG having a node to be removed). Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all instances from discovered ASGs. That immediate refresh isn't required anyway, as the cache's DeleteInstances concrete implementation will decrement the nodegroup size, and we can schedule a grouped refresh for the next loop iteration.
Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`) to the proper instance architecture. While at it, re-gen the instance types list (adding new instance types that were missing)
The current implementation assumes MIG ids have the "https://content.googleapis.com" prefix, while the canonical id format seems to begin with "https://www.googleapis.com". Both formats work while talking to the GCE API, but the API returns the latter and other GCP services seem to assume it as well.
Cluster Autoscaler GCE: change the format of MIG id
…ycloud-provider cloudprovider: add Bizflycloud provider
Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS
Enable magnum provider scale to zero
…piles aws: Don't pile up successive full refreshes during AWS scaledowns
Release leader election lock on shutdown
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not bounded to MIGs attached to the current cluster, but rather lists all MIGs in the project/zone, and therefore equally affects all clusters in that project/zone. Running the calls concurrently over the region's zones (so at most, 4 concurrent API calls, about once per minute) contains that impact. findMigsInRegion might be scoped to the current cluster (name pattern), but also benefits from the same improvement, as it's also costly and called at each refreshInterval (1mn). Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per API call, in my tests), sequentially and with the global cache lock held (when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we already get that bit of information (MIG's basename) from any other mig.Get or mig.List call, like the one fetching target sizes. Leveraging this helps significantly on large fleets (for instance this shaves 8mn startup time on the large cluster I tested on).
…locatables support "/"separators in custom allocatable overrides via vmss tags
add stable zone labels in azure template generation
Document that TLS bootstrapping may be necessary for scale-up
Enable custom k8s fork in update-vendor.sh
Right now the file is breaking `go mod` commands.
@elmiko: This pull request references Bugzilla bug 1992823, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
not quite sure what this means:
i do see it locally while testing though, going to dig in a little bit. |
it looks like the unit test is failing with a panic in the azure provider suite, all the capi tests pass. |
/test e2e-aws-operator |
i have a feeling the unit test failures are related to the timeout we set in |
confirmed my suspicions, i am updating the |
This change carries files and modifications that are used by OpenShift release infrastructure and related files. * spec file * dockerfiles * vertical-pod-autoscaler/Dockerfile.rhel * images/cluster-autoscaler/Dockerfile * images/cluster-autoscaler/Dockerfile.rhel * hack scripts (ci and build related) * Makefile * JUnit tools * update gitignore * update/remove OWNERS files * ci-operator config yaml Co-authored-by: Avesh Agarwal <avagarwa@redhat.com> Co-authored-by: Jan Chaloupka <jchaloup@redhat.com> Co-authored-by: Clayton Coleman <ccoleman@redhat.com> Co-authored-by: Andrew McDermott <amcdermo@redhat.com> Co-authored-by: Michael Gugino <mgugino@redhat.com> Co-authored-by: paulfantom <pawel@krupa.net.pl> Co-authored-by: Joel Smith <joelsmith@redhat.com> Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Yaakov Selkowitz <yselkowi@redhat.com>
… use machine.openshift.io This change updates the default annotations and cluster name label to use the openshift specific values. Also updates the unit tests to incorporate the changed api group and adds the capi group environment variable awareness to the tests. Co-authored-by: Michael McCune <elmiko@redhat.com>
This reverts commit 6601bf0. See kubernetes#2495
This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>
914c7c3
to
f3b5e63
Compare
@elmiko: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test e2e-azure-operator Process looks good, tests seem to be passing for the most part |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm hold is for e2e-azure-operator to pass |
/lgtm |
cancelling hold as all tests are passing /hold cancel |
@elmiko: All pull requests linked via external trackers have merged: Bugzilla bug 1992823 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1.22 autoscaler rebase process
inspired by the commit description for the 1.21 rebase.
pr #201
identify carry commits. usually this step would involve looking through the history to determine the carry commits, but in this case i have squashed many of the carry commits as described in OCPCLOUD-1207. the carried commits have been sourced from this branch https://github.com/elmiko/kubernetes-autoscaler/tree/openshift-1.21-history-cleanup
After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:
Process
With the
merge-1.22
branch in place, I cherry picked the carry commits from my history cleanup branch.Carried Commits
These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.
Dropped Commits
These commits were present in the 1.21 branch, most of them have been squashed into the carried commits listed above.