Acquire install lease when provisioning a cluster#76238
Acquire install lease when provisioning a cluster#76238danilo-gemoli wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danilo-gemoli The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@danilo-gemoli: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[REHEARSALNOTIFIER]
A total of 28371 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/hold |
| cp -rfpv "$backup" "$dir" | ||
| else | ||
| date "+%F %X" > "${SHARED_DIR}/CLUSTER_INSTALL_START_TIME" | ||
| acquire_install_lease_atomic |
There was a problem hiding this comment.
| acquire_install_lease_atomic | |
| acquire_install_lease_atomic || true |
Perhaps? I really don't care if this fails. If it works, it will have a positive impact. If it doesn't it shouldn't change the situation.
There was a problem hiding this comment.
That's fine, so this just an optimization and the whole installation process shouldn't fail if we can't acquire such a lease.
| } | ||
| export -f release_and_acquire_install_lease_atomic | ||
|
|
||
| trap 'release_install_lease_atomic' EXIT TERM INT |
There was a problem hiding this comment.
This trap overwrites previous traps set (prepare_next_steps) which I think will cause problems
Simple example
#!/bin/bash
function prepare_next_steps() {
echo "prepare_next_steps called"
}
function release_install_lease_atomic() {
echo "release_install_lease_atomic called"
}
trap 'prepare_next_steps' EXIT TERM INT
trap 'release_install_lease_atomic' EXIT TERM INT
echo "End, watch which trap fires"
There was a problem hiding this comment.
Agree. We can use trap chaining. Something like:
add_trap() {
local new_cmd="$1"
local signal="$2"
# Extract the current trap command
local existing_cmd
existing_cmd=$(trap -p "$signal" | sed "s/trap -- '\(.*\)' $signal/\1/")
if [[ -z "$existing_cmd" ]]; then
trap "$new_cmd" "$signal"
else
# Prepend or append; usually appending is safer for cleanup
trap "$existing_cmd; $new_cmd" "$signal"
fi
}
| fi | ||
| export INSTALL_LEASE_ENABLED | ||
|
|
||
| export RELEASE_LEASE_DELAY=20m |
There was a problem hiding this comment.
If 50 jobs start at the exact same time (e.g. release controller) and all get a lease, the installers will likely all be doing the same things at the same time on the same cloud provider. They'll then all release around the exact same time, and another 50 might start.
We might want to stagger lease acquisition randomly, by having a delay before we acquire the lease. I tried to solve a similar problem in openshift/release-controller#737, but it is simpler to do here.
delay=$(( RANDOM % 901 ))
printf 'Waiting %dm%ds before acquiring install lease\n' $(( delay / 60 )) $(( delay % 60 ))
sleep $delay
acquire_install_lease_atomic
There was a problem hiding this comment.
That could be an optimization, but the intent here is to ratchet down to a known sustainable number of concurrent installers. Both in terms of count & duration. Once we find the sweet spot, introducing jitter could allow us to increase count, but at this point, non-determinism could confuse the ratcheting process.
There was a problem hiding this comment.
non-determinism could confuse the ratcheting process
What would be confused? Jitter before install start would be more valuable than limiting concurrent installs, IMHO. There's a huge number of aggregated jobs that hit 1-2 minute blips in build infrastructure that end up killing the entire payload because we go below the threshold we need for statistical confidence. Not to mention the thundering herd on specific cloud resources (e.g. all RC jobs creating load balancers at the same time)
There was a problem hiding this comment.
Without jitter, I think this PR will make the situation worse. As it applies globally to everything installed for that cloud provider, we're going to start triggering more installs to occur in simultaneous waves.
Imagine we limit to 50 concurrent installs.
-
RC triggers 50 jobs. Over those 20 minutes, 200 more pile up. 200 jobs are now sitting in a "wait" state (instead of starting at least offset by each other a limit bit)
-
The moment those first 50 leases expire (at exactly
$t + 20$ minutes), another 50 will start the exact same time. -
20 minutes later, 50 more start.
Instead of a chaotic but distributed flow -- with peaks and valleys -- you’ve created a square wave pattern. You will see 100% utilization of the lease bucket, and always have high numbers of installs starting at the exact same time, compounding the problem of "all the installers doing the same thing in the cloud at the same time"
This PR tries to mitigate the rate-limiting issue we are facing on several cloud accounts, Azure being the most affected one.
The core idea is about limiting the number of jobs, per cluster profile (hence cloud account), that are allowed to provision a cluster.
We achieve that by acquiring an
installlease (see #76230) from a small pool, and holding it only for about 20m.During the first 20m
openshift-installmakes a lot of requests to a cloud provider, therefore increasing the odds of being rate-limited, particularly during the CI rush hours.There is a lot of going on in this script, but reviewing it is much easier assuming this mental model:
ipi-install-installmakes several attempt to create a cluster, with regard to install lease acquisition, the execution flow performs what follows:nth iteration starts.release_and_acquire_install_lease_atomic.2a. Otherwise acquire one
acquire_install_lease_atomic.release_install_lease_delayed_atomic &.openshift-install create cluster.trap 'release_install_lease_atomic' EXIT TERM INT.Since at least two processes are involved in this workflow, the functions
acquire_leaseandrelease_leaseare atomic and they rely on the flock synchronization primitive.The lease proxy client scripts are always available
source "$LEASE_PROXY_CLIENT_SH", see openshift/ci-tools#5010.They have been defined in #75306.