Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22 #209

Merged
merged 154 commits into from
Aug 12, 2021

Conversation

elmiko
Copy link

@elmiko elmiko commented Aug 11, 2021

1.22 autoscaler rebase process

inspired by the commit description for the 1.21 rebase.
pr #201

identify carry commits. usually this step would involve looking through the history to determine the carry commits, but in this case i have squashed many of the carry commits as described in OCPCLOUD-1207. the carried commits have been sourced from this branch https://github.com/elmiko/kubernetes-autoscaler/tree/openshift-1.21-history-cleanup

After identifying the carry commits, the next step is to create the new commit-tree that
will be used for the rebase and then cherry pick the carry commits into the new branch.
The following commands cover these steps:

Process

$ git remote update # make sure we update our refs
$ git checkout cluster-autoscaler-1.22.0
$ git checkout -b merge-tmp # create a branch to do our merge work from
$ git checkout openshift/master # we want to be at the tip of the openshift master branch when we run the next command
$ echo 'merge upstream/cluster-autoscaler-1.22' | git commit-tree merge-tmp^{tree} -p HEAD -p merge-tmp -F -
deadbeef12345678 # id of new merge commit
$ git branch merge-1.22 deadbeef12345678 # create a new branch for the cherry-pick work
$ git checkout merge-1.22 # make sure we are on the proper branch
$ git cherry-pick <carry commits>

With the merge-1.22 branch in place, I cherry picked the carry commits from my history cleanup branch.

Carried Commits

These commits are for features which have not yet been accepted upstream, are integral to our CI platform, or are
specific to the releases we create for OpenShift.

f2519d93d UPSTREAM: <carry>: Implement scale from zero
8f4a01718 UPSTREAM: <carry>: Rename FailureMessage to ErrorMessage
1868d6a46 UPSTREAM: <carry>: Revert "Adding config for event filtering"
8bce568f3 UPSTREAM: <carry>: Set default annotations, and cluster name label to use machine.openshift.io
a5e56e715 UPSTREAM: <carry>: configure repository for OpenShift releases

Dropped Commits

These commits were present in the 1.21 branch, most of them have been squashed into the carried commits listed above.

fbe533c75 UPSTREAM: <carry>: openshift: Fix build with Go 1.16
bb49a8586 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/e3b61ff1dde2bfc2bc7f2b5efdd155d1d3299cd7/images/vertical-pod-autoscaler.yml
c6f7a0a95 UPSTREAM: <carry>: VPA: Add Chen to approvers
b6e96d848 UPSTREAM: <carry>: VPA: Update Dockerfiles to use new registry
c99b47a92 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/5a1293dd0f380abf50c12d65c36655486d7745d0/images/vertical-pod-autoscaler.yml
1a977f016 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/5a1293dd0f380abf50c12d65c36655486d7745d0/images/atomic-openshift-cluster-autoscaler.yml
1baf08f09 UPSTREAM: <carry>: openshift: add component and subcomponent to OWNERS
be51a45f9 UPSTREAM: <carry>: openshift: add fixes for unstructured scalable resources and node group discovery
024d0fa13 UPSTREAM: <carry>: Updating vertical-pod-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/vertical-pod-autoscaler.yml
3bebbaec8 UPSTREAM: <carry>: Updating atomic-openshift-cluster-autoscaler builder & base images to be consistent with ART Reconciling with https://github.com/openshift/ocp-build-data/tree/f82a216a6a3707b80a635bace9367f1a8288b7a7/images/atomic-openshift-cluster-autoscaler.yml
c20293a38 UPSTREAM: <carry>: openshift: Add tests for nodegroup TemplateNodeInfo
65b1b7d48 UPSTREAM: <carry>: openshift: Copy standard labels from an existing node if one exists
c0a685b68 UPSTREAM: <carry>: openshift: Ensure stable versions of generic labels are also included
8f67a8c4d UPSTREAM: <carry>: openshift: Updating vertical-pod-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
1eba04782 UPSTREAM: <carry>: openshift: Updating images/cluster-autoscaler/Dockerfile.rhel baseimages to match ocp-build-data config
3bcdabda5 UPSTREAM: <carry>: openshift: Add vertical-pod-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
24a45451c UPSTREAM: <carry>: openshift: Add images/cluster-autoscaler/Dockerfile.rhel to match build configuration in ocp-build-data
15a768ff9 UPSTREAM: <carry>: Convert the mem value consistently with other providers
bb9b3826b UPSTREAM: <carry>: openshift: Implement scale from zero
9bbf2eeb8 UPSTREAM: <carry>: openshift: Rename FailureMessage to ErrorMessage
f302c07bd UPSTREAM: <carry>: openshift: update cluster-autoscaler OWNERS
7f192ac8a UPSTREAM: <carry>: openshift: Revert "Adding config for event filtering"
aef98374e UPSTREAM: <carry>: openshift: Set default annotations, and cluster name label to use machine.openshift.io
e4c5171bd UPSTREAM: <carry>: Bump scripts go version to 1.13 and update path to cloudprovider/clusterapi
2839e4250 UPSTREAM: <carry>: openshift: Extend makefile with 'make goimports' target
589021834 UPSTREAM: <carry>: Fix git commit message verification script
d55f2e8f3 UPSTREAM: <carry>: openshift: Switch builds to use Go 1.12
bb81822af UPSTREAM: <carry>: openshift: create git history verification script
1994c5ee3 UPSTREAM: <carry>: openshift: Add fmt, lint, vet scripts/Makefile
7ad1cdce2 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: set golang_version to 1.10
5a6b51716 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: bump golang_version to 1.10.4
0ccae207e UPSTREAM: <carry>: openshift: Fix spec file to be consistent
56e629f74 UPSTREAM: <carry>: openshift: Bump embedded tools
0d6ceb9f0 UPSTREAM: <carry>: openshift: Fix the spec and hack scripts so the package can be built in both build systems
f66358622 UPSTREAM: <carry>: openshift: Add openshift/release Makefile and hack scripts
d5d0ca03c UPSTREAM: <carry>: openshift: Add dockerfile for cluster autoscaler.
356231f8c UPSTREAM: <carry>: openshift: Add spec file for cluster-autoscaler.

mcristina422 and others added 30 commits March 12, 2021 12:51
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
Now supported by magnum.
https://review.opendev.org/c/openstack/magnum/+/737580/

If using node group autodiscovery, older versions
of magnum will still forbid scaling to zero or
setting the minimum node count to zero.
Force refreshing everything at every DeleteNodes calls causes slow down
and throttling on large clusters with many ASGs (and lot of activity).

That function might be called several times in a row during scale-down
(once for each ASG having a node to be removed). Each time the forced
refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all
instances from discovered ASGs.

That immediate refresh isn't required anyway, as the cache's DeleteInstances
concrete implementation will decrement the nodegroup size, and we can
schedule a grouped refresh for the next loop iteration.
Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`)
to the proper instance architecture.

While at it, re-gen the instance types list (adding new instance types
that were missing)
The current implementation assumes MIG ids have the
"https://content.googleapis.com" prefix, while the
canonical id format seems to begin with "https://www.googleapis.com".
Both formats work while talking to the GCE API, but
the API returns the latter and other GCP services seem to
assume it as well.
Cluster Autoscaler GCE: change the format of MIG id
…ycloud-provider

cloudprovider: add Bizflycloud provider
Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS
…piles

aws: Don't pile up successive full refreshes during AWS scaledowns
Release leader election lock on shutdown
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.

findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).

Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).
…locatables

support "/"separators in custom allocatable overrides via vmss tags
add stable zone labels in azure template generation
Document that TLS bootstrapping may be necessary for scale-up
Right now the file is breaking `go mod` commands.
@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2021

@elmiko: This pull request references Bugzilla bug 1992823, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko
Copy link
Author

elmiko commented Aug 11, 2021

not quite sure what this means:

[WARNING] While the jUnit report found no failed tests, the `go test` process failed.
[WARNING] This usually means that the unit test suite failed to compile.
[ERROR] hack/test-go.sh exited with code 1 after 00h 03m 24s 

i do see it locally while testing though, going to dig in a little bit.

@elmiko
Copy link
Author

elmiko commented Aug 11, 2021

it looks like the unit test is failing with a panic in the azure provider suite, all the capi tests pass.

@elmiko
Copy link
Author

elmiko commented Aug 11, 2021

/test e2e-aws-operator

@elmiko
Copy link
Author

elmiko commented Aug 11, 2021

i have a feeling the unit test failures are related to the timeout we set in hack/test-go.sh. it is currently set to 120 seconds, but i see the azure test taking around 150-160 seconds. this test passes in the upstream, i am running some local tests with out timeout set to 180 seconds.

@elmiko
Copy link
Author

elmiko commented Aug 11, 2021

confirmed my suspicions, i am updating the hack/test-go.sh script to have a 180 second time out.

elmiko and others added 5 commits August 11, 2021 17:03
This change carries files and modifications that are used by OpenShift
release infrastructure and related files.

* spec file
* dockerfiles
  * vertical-pod-autoscaler/Dockerfile.rhel
  * images/cluster-autoscaler/Dockerfile
  * images/cluster-autoscaler/Dockerfile.rhel
* hack scripts (ci and build related)
* Makefile
* JUnit tools
* update gitignore
* update/remove OWNERS files
* ci-operator config yaml

Co-authored-by: Avesh Agarwal <avagarwa@redhat.com>
Co-authored-by: Jan Chaloupka <jchaloup@redhat.com>
Co-authored-by: Clayton Coleman <ccoleman@redhat.com>
Co-authored-by: Andrew McDermott <amcdermo@redhat.com>
Co-authored-by: Michael Gugino <mgugino@redhat.com>
Co-authored-by: paulfantom <pawel@krupa.net.pl>
Co-authored-by: Joel Smith <joelsmith@redhat.com>
Co-authored-by: Enxebre <alberto.garcial@hotmail.com>
Co-authored-by: Yaakov Selkowitz <yselkowi@redhat.com>
… use machine.openshift.io

This change updates the default annotations and cluster name label to
use the openshift specific values.

Also updates the unit tests to incorporate the changed api group and
adds the capi group environment variable awareness to the tests.

Co-authored-by: Michael McCune <elmiko@redhat.com>
This allows a Machine{Set,Deployment} to scale up/down from 0,
providing the following annotations are set:

```yaml
apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: MachineSet
  metadata:
    annotations:
      machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0"
      machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"
      machine.openshift.io/vCPU: "2"
      machine.openshift.io/memoryMb: 8G
      machine.openshift.io/GPU: "1"
      machine.openshift.io/maxPods: "100"
```

Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods`
are optional.

For autoscaling from zero, the autoscaler should convert the mem value
received in the appropriate annotation to bytes using powers of two
consistently with other providers and fail if the format received is not
expected. This gives robust behaviour consistent with cloud providers APIs
and providers implementations.

https://cloud.google.com/compute/all-pricing
https://www.iec.ch/si/binary.htm
https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366

Co-authored-by:  Enxebre <alberto.garcial@hotmail.com>
Co-authored-by:  Joel Speed <joel.speed@hotmail.co.uk>
Co-authored-by:  Michael McCune <elmiko@redhat.com>
@openshift-ci
Copy link

openshift-ci bot commented Aug 12, 2021

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/git-history f3b5e63 link /test git-history

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@JoelSpeed
Copy link

/test e2e-azure-operator

Process looks good, tests seem to be passing for the most part
/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2021
@Fedosin
Copy link

Fedosin commented Aug 12, 2021

/lgtm
/hold

hold is for e2e-azure-operator to pass

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2021
@lobziik
Copy link

lobziik commented Aug 12, 2021

/lgtm

@elmiko
Copy link
Author

elmiko commented Aug 12, 2021

cancelling hold as all tests are passing

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2021
@openshift-ci openshift-ci bot merged commit 68fe93a into openshift:master Aug 12, 2021
@openshift-ci
Copy link

openshift-ci bot commented Aug 12, 2021

@elmiko: All pull requests linked via external trackers have merged:

Bugzilla bug 1992823 has been moved to the MODIFIED state.

In response to this:

Bug 1992823: rebase on top of kubernetes/autoscaler 1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko elmiko deleted the openshift-rebase-1.22 branch August 12, 2021 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.