Acquire install lease when provisioning a cluster by danilo-gemoli · Pull Request #76238 · openshift/release

danilo-gemoli · 2026-03-13T14:02:36Z

This PR tries to mitigate the rate-limiting issue we are facing on several cloud accounts, Azure being the most affected one.
The core idea is about limiting the number of jobs, per cluster profile (hence cloud account), that are allowed to provision a cluster.

We achieve that by acquiring an install lease (see #76230) from a small pool, and holding it only for about 20m.
During the first 20m openshift-install makes a lot of requests to a cloud provider, therefore increasing the odds of being rate-limited, particularly during the CI rush hours.

There is a lot of going on in this script, but reviewing it is much easier assuming this mental model:

source "$LEASE_PROXY_CLIENT_SH"

function acquire_install_lease_atomic { ... }
function release_install_lease_delayed_atomic { ... }
function release_and_acquire_install_lease_atomic { ... }

trap 'release_install_lease_atomic' EXIT TERM INT

max=5
tries=1
ret=4
release_install_lease_pid=''
while [ $ret -eq 4 ] && [ $tries -le $max ]
do
  echo "Install attempt $tries of $max"

  if [ $tries -gt 1 ]; then
    if [[ -n "$release_install_lease_pid" ]] && ps -p "$release_install_lease_pid"; then
      kill "$release_install_lease_pid"
    fi
    release_install_lease_pid=''
    release_and_acquire_install_lease_atomic
  else
    acquire_install_lease_atomic
  fi

  release_install_lease_delayed_atomic &
  release_install_lease_pid=$!

  openshift-install create cluster
  ret="$?"

  echo "Installer exit with code $ret"
  tries=$((tries+1))
done

ipi-install-install makes several attempt to create a cluster, with regard to install lease acquisition, the execution flow performs what follows:

The nth iteration starts.
If a lease is being held already, then release it and acquire a new one release_and_acquire_install_lease_atomic.
2a. Otherwise acquire one acquire_install_lease_atomic.
Start a process in the background that release the just acquired lease after 20m release_install_lease_delayed_atomic &.
Create a cluster openshift-install create cluster.
Go to (1) if (4) fails.
Otherwise exit the loop and complete the execution.
Release any pending lease trap 'release_install_lease_atomic' EXIT TERM INT.

Since at least two processes are involved in this workflow, the functions acquire_lease and release_lease are atomic and they rely on the flock synchronization primitive.

The lease proxy client scripts are always available source "$LEASE_PROXY_CLIENT_SH", see openshift/ci-tools#5010.
They have been defined in #75306.

openshift-ci · 2026-03-13T14:03:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/ipi/OWNERS~~ [danilo-gemoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-03-13T14:09:56Z

@danilo-gemoli: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/step-registry-shellcheck	`c77b570`	link	true	`/test step-registry-shellcheck`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-03-13T14:10:10Z

[REHEARSALNOTIFIER]
@danilo-gemoli: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-red-hat-data-services-ods-ci-release-2.19-rhoai-ocp4.19-interop-rhoai-interop-aws	red-hat-data-services/ods-ci	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-main-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-5.0-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.23-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.22-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.21-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.20-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.19-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.18-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.17-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.16-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.15-cma-e2e-aws-ovn	openshift/custom-metrics-autoscaler-operator	presubmit	Registry content changed
pull-ci-redhat-developer-intellij-openshift-connector-main-e2e-openshift	redhat-developer/intellij-openshift-connector	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-main-okd-scos-e2e-aws-ovn	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.21-okd-scos-e2e-aws-ovn	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-main-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-5.0-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.23-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.22-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.21-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.20-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.19-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.18-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.17-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.16-e2e-openstack	openshift/baremetal-runtimecfg	presubmit	Registry content changed

A total of 28371 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

danilo-gemoli · 2026-03-13T14:12:04Z

/hold

jupierce · 2026-03-13T14:21:19Z

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

    cp -rfpv "$backup" "$dir"
  else
    date "+%F %X" > "${SHARED_DIR}/CLUSTER_INSTALL_START_TIME"
+    acquire_install_lease_atomic


Suggested change

acquire_install_lease_atomic

acquire_install_lease_atomic || true

Perhaps? I really don't care if this fails. If it works, it will have a positive impact. If it doesn't it shouldn't change the situation.

That's fine, so this just an optimization and the whole installation process shouldn't fail if we can't acquire such a lease.

stbenjam

Really nice idea

stbenjam · 2026-03-13T15:04:02Z

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

+}
+export -f release_and_acquire_install_lease_atomic
+
+trap 'release_install_lease_atomic' EXIT TERM INT


This trap overwrites previous traps set (prepare_next_steps) which I think will cause problems

Simple example

#!/bin/bash function prepare_next_steps() { echo "prepare_next_steps called" } function release_install_lease_atomic() { echo "release_install_lease_atomic called" } trap 'prepare_next_steps' EXIT TERM INT trap 'release_install_lease_atomic' EXIT TERM INT echo "End, watch which trap fires"

Agree. We can use trap chaining. Something like:

add_trap() { local new_cmd="$1" local signal="$2" # Extract the current trap command local existing_cmd existing_cmd=$(trap -p "$signal" | sed "s/trap -- '$.*$' $signal/\1/") if [[ -z "$existing_cmd" ]]; then trap "$new_cmd" "$signal" else # Prepend or append; usually appending is safer for cleanup trap "$existing_cmd; $new_cmd" "$signal" fi }

stbenjam · 2026-03-13T15:11:23Z

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

+fi
+export INSTALL_LEASE_ENABLED
+
+export RELEASE_LEASE_DELAY=20m


If 50 jobs start at the exact same time (e.g. release controller) and all get a lease, the installers will likely all be doing the same things at the same time on the same cloud provider. They'll then all release around the exact same time, and another 50 might start.

We might want to stagger lease acquisition randomly, by having a delay before we acquire the lease. I tried to solve a similar problem in openshift/release-controller#737, but it is simpler to do here.

delay=$(( RANDOM % 901 )) printf 'Waiting %dm%ds before acquiring install lease\n' $(( delay / 60 )) $(( delay % 60 )) sleep $delay acquire_install_lease_atomic

That could be an optimization, but the intent here is to ratchet down to a known sustainable number of concurrent installers. Both in terms of count & duration. Once we find the sweet spot, introducing jitter could allow us to increase count, but at this point, non-determinism could confuse the ratcheting process.

non-determinism could confuse the ratcheting process

What would be confused? Jitter before install start would be more valuable than limiting concurrent installs, IMHO. There's a huge number of aggregated jobs that hit 1-2 minute blips in build infrastructure that end up killing the entire payload because we go below the threshold we need for statistical confidence. Not to mention the thundering herd on specific cloud resources (e.g. all RC jobs creating load balancers at the same time)

Without jitter, I think this PR will make the situation worse. As it applies globally to everything installed for that cloud provider, we're going to start triggering more installs to occur in simultaneous waves.

Imagine we limit to 50 concurrent installs.

RC triggers 50 jobs. Over those 20 minutes, 200 more pile up. 200 jobs are now sitting in a "wait" state (instead of starting at least offset by each other a limit bit)

The moment those first 50 leases expire (at exactly $t + 20$ minutes), another 50 will start the exact same time.

20 minutes later, 50 more start.

Instead of a chaotic but distributed flow -- with peaks and valleys -- you’ve created a square wave pattern. You will see 100% utilization of the lease bucket, and always have high numbers of installs starting at the exact same time, compounding the problem of "all the installers doing the same thing in the cloud at the same time"

danilo-gemoli · 2026-03-19T11:52:08Z

@jupierce @stbenjam I have improved the designed of the library, have a look at #76538. The use case Use Case 4 - Refresh install leases applies here.

feat(step-registry): acquire install lease

c77b570

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026

openshift-ci bot requested review from dgoodwin and stbenjam March 13, 2026 14:03

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2026

jupierce reviewed Mar 13, 2026

View reviewed changes

stbenjam reviewed Mar 13, 2026

View reviewed changes

danilo-gemoli mentioned this pull request Mar 19, 2026

Lease Proxy Client: Add many improvements #76538

Open

	acquire_install_lease_atomic
	acquire_install_lease_atomic \|\| true

Conversation

danilo-gemoli commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Uh oh!

danilo-gemoli commented Mar 13, 2026

Uh oh!

jupierce Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

danilo-gemoli Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

stbenjam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stbenjam Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

jupierce Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

stbenjam Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

jupierce Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

stbenjam Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stbenjam Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danilo-gemoli commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danilo-gemoli commented Mar 13, 2026 •

edited

Loading

stbenjam left a comment •

edited

Loading

stbenjam Mar 13, 2026 •

edited

Loading

stbenjam Mar 13, 2026 •

edited

Loading