ci-operator: retry infra-failed builds immediately#2648
Conversation
jupierce
left a comment
There was a problem hiding this comment.
One thought. Otherwise, looks great!
| return fmt.Errorf("could not create build %s: %w", build.Name, err) | ||
| var buildErr error | ||
| attempts := 3 | ||
| if boErr := wait.ExponentialBackoff(wait.Backoff{Duration: 1 * time.Minute, Factor: 2, Steps: attempts}, func() (bool, error) { |
There was a problem hiding this comment.
Seems like it takes about 5 minutes for a new node to come online. This backoff gives us a chance at one new node arriving in time. Given the low cost of the kubelet rejecting us, I'd consider attempts=5 and factor=1.5 which would give us a few chances for new nodes joining the back.
There was a problem hiding this comment.
I've always found the apimachinery parameters confusing, but note that these values will give us:
n, d, f, t = 5, 1, 1.5, 0
print("i", "t", "d")
for i = 0, n - 1 do
print(i, t, d)
t, d = t + d, d * f
endi t d
0 0 1
1 1 1.5
2 2.5 2.25
3 4.75 3.375
4 8.125 5.0625
Which will give us a single retry (the last) after a new node arrives at t ~ 5, unless the combined retries take more than 1.875s (I don't know how long they take and could not find an example).
There was a problem hiding this comment.
I think that we do not necessarily need to retry until there is a new node - that's scheduler's job. We just need to give the scheduler some chance to retry the bad placement that results in OutOfCpu. It is fine if the retried Pod ends up unschedulable, that makes it wait for the node.
`ci-operator` was already able to recognize infrastructure-failed builds from previous runs and retry them. This is an attempt to reuse that code to retry such failed builds immediately, with two attempts in an exponential backoff. The backoff has an intentionally long starting delay of 1 minute to give the infrastructure problem a chance to go away. The way the code is structured makes it less optimal for the case where we are retrying infra failures from the previous executions: it will eat one of the backoff iterations, but such cases should be rare because ci-op runs should not result in failures caused by infrastructure failures anymore (because they are retried immediately).
efee15c to
c056a8c
Compare
|
/test integration |
| return false, nil | ||
| }); boErr != nil { | ||
| if boErr == wait.ErrWaitTimeout { | ||
| return fmt.Errorf("build not successful after %d attempts, last error: %w", attempts, buildErr) |
There was a problem hiding this comment.
May be more informative to wrap an aggregate with all errors, instead of just the last one.
There was a problem hiding this comment.
Good point, will address
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jupierce, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/hold |
|
🖕 Tide |
ci-operatorwas already able to recognize infrastructure-failed buildsfrom previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).
/cc @bbguimaraes @jupierce @openshift/test-platform