Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

elmiko · 2021-08-11T19:17:38Z

1.22 autoscaler rebase process

inspired by the commit description for the 1.21 rebase.
pr #201

identify carry commits. usually this step would involve looking through the history to determine the carry commits, but in this case i have squashed many of the carry commits as described in OCPCLOUD-1207. the carried commits have been sourced from this branch https://github.com/elmiko/kubernetes-autoscaler/tree/openshift-1.21-history-cleanup

After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:

Process

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-1.22.0
$ git checkout -b merge-tmp # create a branch to do our merge work from
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.22' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F -
deadbeef12345678 # id of new merge commit
$ git branch merge-1.22 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.22 # make sure we are on the proper branch
$ git cherry-pick <carry commits>

With the merge-1.22 branch in place, I cherry picked the carry commits from my history cleanup branch.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.

f2519d93d UPSTREAM: <carry>: Implement scale from zero
8f4a01718 UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
1868d6a46 UPSTREAM: <carry>: Revert "Adding config for event filtering"
8bce568f3 UPSTREAM: <carry>: Set default annotations, and cluster name label to use machine.openshift.io
a5e56e715 UPSTREAM: <carry>: configure repository for OpenShift releases

Dropped Commits

These commits were present in the 1.21 branch, most of them have been squashed into the carried commits listed above.

fbe533c75 UPSTREAM: <carry>: openshift: Fix build with Go 1.16
bb49a8586 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/e3b61ff1dde2bfc2bc7f2b5efdd155d1d3299cd7/images/vertical-pod-autoscaler.yml
c6f7a0a95 UPSTREAM: <carry>: VPA: Add Chen to approvers
b6e96d848 UPSTREAM: <carry>: VPA: Update Dockerfiles to use new registry
c99b47a92 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/5a1293dd0f380abf50c12d65c36655486d7745d0/images/vertical-pod-autoscaler.yml
1a977f016 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/5a1293dd0f380abf50c12d65c36655486d7745d0/images/atomic-openshift-cluster-autoscaler.yml
1baf08f09 UPSTREAM: <carry>: openshift: add component and subcomponent to OWNERS
be51a45f9 UPSTREAM: <carry>: openshift: add fixes for unstructured scalable resources and node group discovery
024d0fa13 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml
3bebbaec8 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml
c20293a38 UPSTREAM: <carry>: openshift: Add tests for nodegroup TemplateNodeInfo
65b1b7d48 UPSTREAM: <carry>: openshift: Copy standard labels from an existing node if one exists
c0a685b68 UPSTREAM: <carry>: openshift: Ensure stable versions of generic labels are also included
8f67a8c4d UPSTREAM: <carry>: openshift: Updating vertical-pod-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
1eba04782 UPSTREAM: <carry>: openshift: Updating images/cluster-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
3bcdabda5 UPSTREAM: <carry>: openshift: Add vertical-pod-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
24a45451c UPSTREAM: <carry>: openshift: Add images/cluster-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
15a768ff9 UPSTREAM: <carry>: Convert the mem value consistently with other providers
bb9b3826b UPSTREAM: <carry>: openshift: Implement scale from zero
9bbf2eeb8 UPSTREAM: <carry>: openshift: Rename FailureMessage to ErrorMessage
f302c07bd UPSTREAM: <carry>: openshift: update cluster-autoscaler OWNERS
7f192ac8a UPSTREAM: <carry>: openshift: Revert "Adding config for event filtering"
aef98374e UPSTREAM: <carry>: openshift: Set default annotations, and cluster name label to use machine.openshift.io
e4c5171bd UPSTREAM: <carry>: Bump scripts go version to 1.13 and update path to cloudprovider/clusterapi
2839e4250 UPSTREAM: <carry>: openshift: Extend makefile with 'make goimports' target
589021834 UPSTREAM: <carry>: Fix git commit message verification script
d55f2e8f3 UPSTREAM: <carry>: openshift: Switch builds to use Go 1.12
bb81822af UPSTREAM: <carry>: openshift: create git history verification script
1994c5ee3 UPSTREAM: <carry>: openshift: Add fmt, lint, vet scripts/Makefile
7ad1cdce2 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: set golang_version to 1.10
5a6b51716 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: bump golang_version to 1.10.4
0ccae207e UPSTREAM: <carry>: openshift: Fix spec file to be consistent
56e629f74 UPSTREAM: <carry>: openshift: Bump embedded tools
0d6ceb9f0 UPSTREAM: <carry>: openshift: Fix the spec and hack scripts so the package can be built in both build systems
f66358622 UPSTREAM: <carry>: openshift: Add openshift/release Makefile and hack scripts
d5d0ca03c UPSTREAM: <carry>: openshift: Add dockerfile for cluster autoscaler.
356231f8c UPSTREAM: <carry>: openshift: Add spec file for cluster-autoscaler.

This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.

Now supported by magnum. https://review.opendev.org/c/openstack/magnum/+/737580/ If using node group autodiscovery, older versions of magnum will still forbid scaling to zero or setting the minimum node count to zero.

Force refreshing everything at every DeleteNodes calls causes slow down and throttling on large clusters with many ASGs (and lot of activity). That function might be called several times in a row during scale-down (once for each ASG having a node to be removed). Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all instances from discovered ASGs. That immediate refresh isn't required anyway, as the cache's DeleteInstances concrete implementation will decrement the nodegroup size, and we can schedule a grouped refresh for the next loop iteration.

Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`) to the proper instance architecture. While at it, re-gen the instance types list (adding new instance types that were missing)

…WNERS

The current implementation assumes MIG ids have the "https://content.googleapis.com" prefix, while the canonical id format seems to begin with "https://www.googleapis.com". Both formats work while talking to the GCE API, but the API returns the latter and other GCP services seem to assume it as well.

Cluster Autoscaler GCE: change the format of MIG id

…ycloud-provider cloudprovider: add Bizflycloud provider

Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS

Enable magnum provider scale to zero

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

Release leader election lock on shutdown

FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not bounded to MIGs attached to the current cluster, but rather lists all MIGs in the project/zone, and therefore equally affects all clusters in that project/zone. Running the calls concurrently over the region's zones (so at most, 4 concurrent API calls, about once per minute) contains that impact. findMigsInRegion might be scoped to the current cluster (name pattern), but also benefits from the same improvement, as it's also costly and called at each refreshInterval (1mn). Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per API call, in my tests), sequentially and with the global cache lock held (when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we already get that bit of information (MIG's basename) from any other mig.Get or mig.List call, like the one fetching target sizes. Leveraging this helps significantly on large fleets (for instance this shaves 8mn startup time on the large cluster I tested on).

…locatables support "/"separators in custom allocatable overrides via vmss tags

add stable zone labels in azure template generation

Document that TLS bootstrapping may be necessary for scale-up

Enable custom k8s fork in update-vendor.sh

Right now the file is breaking `go mod` commands.

openshift-ci · 2021-08-11T19:23:17Z

@elmiko: This pull request references Bugzilla bug 1992823, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko · 2021-08-11T20:07:12Z

not quite sure what this means:

[WARNING] While the jUnit report found no failed tests, the `go test` process failed.
[WARNING] This usually means that the unit test suite failed to compile.
[ERROR] hack/test-go.sh exited with code 1 after 00h 03m 24s

i do see it locally while testing though, going to dig in a little bit.

elmiko · 2021-08-11T20:12:54Z

it looks like the unit test is failing with a panic in the azure provider suite, all the capi tests pass.

elmiko · 2021-08-11T20:25:14Z

/test e2e-aws-operator

elmiko · 2021-08-11T20:44:24Z

i have a feeling the unit test failures are related to the timeout we set in hack/test-go.sh. it is currently set to 120 seconds, but i see the azure test taking around 150-160 seconds. this test passes in the upstream, i am running some local tests with out timeout set to 180 seconds.

elmiko · 2021-08-11T20:59:59Z

confirmed my suspicions, i am updating the hack/test-go.sh script to have a 180 second time out.

This change carries files and modifications that are used by OpenShift release infrastructure and related files. * spec file * dockerfiles * vertical-pod-autoscaler/Dockerfile.rhel * images/cluster-autoscaler/Dockerfile * images/cluster-autoscaler/Dockerfile.rhel * hack scripts (ci and build related) * Makefile * JUnit tools * update gitignore * update/remove OWNERS files * ci-operator config yaml Co-authored-by: Avesh Agarwal <avagarwa@redhat.com> Co-authored-by: Jan Chaloupka <jchaloup@redhat.com> Co-authored-by: Clayton Coleman <ccoleman@redhat.com> Co-authored-by: Andrew McDermott <amcdermo@redhat.com> Co-authored-by: Michael Gugino <mgugino@redhat.com> Co-authored-by: paulfantom <pawel@krupa.net.pl> Co-authored-by: Joel Smith <joelsmith@redhat.com> Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Yaakov Selkowitz <yselkowi@redhat.com>

… use machine.openshift.io This change updates the default annotations and cluster name label to use the openshift specific values. Also updates the unit tests to incorporate the changed api group and adds the capi group environment variable awareness to the tests. Co-authored-by: Michael McCune <elmiko@redhat.com>

This reverts commit 6601bf0. See kubernetes#2495

This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>

openshift-ci · 2021-08-12T00:29:25Z

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/git-history	`f3b5e63`	link	`/test git-history`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2021-08-12T09:43:16Z

/test e2e-azure-operator

Process looks good, tests seem to be passing for the most part
/approve

openshift-ci · 2021-08-12T09:44:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fedosin · 2021-08-12T09:59:14Z

/lgtm
/hold

hold is for e2e-azure-operator to pass

lobziik · 2021-08-12T10:44:10Z

/lgtm

elmiko · 2021-08-12T16:58:07Z

cancelling hold as all tests are passing

/hold cancel

openshift-ci · 2021-08-12T16:58:40Z

@elmiko: All pull requests linked via external trackers have merged:

openshift/kubernetes-autoscaler#209

Bugzilla bug 1992823 has been moved to the MODIFIED state.

In response to this:

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mcristina422 and others added 30 commits March 12, 2021 12:51

Release leader election lock on shutdown

4cf9a98

Enable magnum provider scale to zero

d103b70

Now supported by magnum. https://review.opendev.org/c/openstack/magnum/+/737580/ If using node group autodiscovery, older versions of magnum will still forbid scaling to zero or setting the minimum node count to zero.

Fix/dependencies

71353a6

Fix/Provider name

b57ba6e

Fix/Add bizflycloud package in skipped_dirs

90bd1eb

Add Bizfly Cloud provider to README

dd8005d

Update license for Bizfly Cloud dependencies

e959385

add required api resources to hetzner cluster-autoscaler example

ec2676b

aws: support arm64 instances

3ffe4b3

Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`) to the proper instance architecture. While at it, re-gen the instance types list (adding new instance types that were missing)

Cluster Autoscaler: remove vivekbagade, add towca as an approver in O…

249a728

…WNERS

Merge pull request kubernetes#4047 from towca/jtuznik/mig-id

3c28030

Cluster Autoscaler GCE: change the format of MIG id

Merge pull request kubernetes#4009 from bizflycloud/bizflycloud/bizfl…

1330ab1

…ycloud-provider cloudprovider: add Bizflycloud provider

Merge pull request kubernetes#4040 from towca/jtuznik/owner

89b2373

Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS

Merge pull request kubernetes#3995 from tghartland/magnum-scale-to-zero

35b8e30

Enable magnum provider scale to zero

Merge pull request kubernetes#3797 from DataDog/aws-not-refreshes-dog…

6c4101b

…piles aws: Don't pile up successive full refreshes during AWS scaledowns

Merge pull request kubernetes#3940 from mcristina422/patch-1

200415e

Release leader election lock on shutdown

support separators in custom allocatable overrides via vmss tags

3e53369

Merge pull request kubernetes#4056 from marwanad/support-separator-al…

c6d4535

…locatables support "/"separators in custom allocatable overrides via vmss tags

add stable zone labels in azure template generation

dda7db0

Merge pull request kubernetes#4061 from marwanad/stable-zone-labels

67dc894

add stable zone labels in azure template generation

Document that TLS bootstrapping may be necessary for scale-up

1b0aa0c

Merge pull request kubernetes#4067 from dharmab/scale-up-q

b70dce3

Document that TLS bootstrapping may be necessary for scale-up

Enable custom k8s fork in update-vendor.sh

23b4329

Merge pull request kubernetes#4023 from BigDarkClown/update-vendor-fork

2e6ccac

Enable custom k8s fork in update-vendor.sh

Replace package satori/go.uuid for cloudprovider ionoscloud

a1577ef

BizFly: remove go.mod from the inlined "gobizfly" client

a5d2700

Right now the file is breaking `go mod` commands.

openshift-ci bot requested review from sunzhaohua2, alexander-demicev and enxebre August 11, 2021 19:23

elmiko and others added 5 commits August 11, 2021 17:03

UPSTREAM: <carry>: Revert "Adding config for event filtering"

72f1954

This reverts commit 6601bf0. See kubernetes#2495

UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage

669be3c

elmiko force-pushed the openshift-rebase-1.22 branch from 914c7c3 to f3b5e63 Compare August 11, 2021 21:05

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2021

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2021

openshift-ci bot assigned Fedosin Aug 12, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2021

openshift-ci bot assigned lobziik Aug 12, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2021

openshift-ci bot merged commit 68fe93a into openshift:master Aug 12, 2021

elmiko deleted the openshift-rebase-1.22 branch August 12, 2021 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

elmiko commented Aug 11, 2021

openshift-ci bot commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

openshift-ci bot commented Aug 12, 2021 •

edited

Loading

JoelSpeed commented Aug 12, 2021

openshift-ci bot commented Aug 12, 2021

Fedosin commented Aug 12, 2021

lobziik commented Aug 12, 2021

elmiko commented Aug 12, 2021

openshift-ci bot commented Aug 12, 2021

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

Conversation

elmiko commented Aug 11, 2021

1.22 autoscaler rebase process

openshift-ci bot commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

elmiko commented Aug 11, 2021

openshift-ci bot commented Aug 12, 2021 • edited Loading

JoelSpeed commented Aug 12, 2021

openshift-ci bot commented Aug 12, 2021

Fedosin commented Aug 12, 2021

lobziik commented Aug 12, 2021

elmiko commented Aug 12, 2021

openshift-ci bot commented Aug 12, 2021

openshift-ci bot commented Aug 12, 2021 •

edited

Loading