Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flightcontrol: protect contention timeouts #5010

Merged
merged 1 commit into from
Jun 10, 2024

Conversation

tonistiigi
Copy link
Member

We had a report of #1822 error showing up in some logs.

Possibly important aspect of this race seems to be that callbacks returning errors are handled differently and retry happens right away when previous callback has errored in the beginning of wait(). This was not handled by previous contention test. With the new test it is easy to reach timeout with ~100 goroutines (and smaller with reduced probability).

This patch reduces the backoff factor so it does not increase too quickly and adds a random factor to initial timeout so that when lots of goroutines hit the error at the same time, they do not all retry at the same time as well. This seems to dramatically reduce the maximum backoff that can be reached by generating contention with the new testcase.

var retryCount int64
g := &Group[string]{}
eg, ctx := errgroup.WithContext(context.Background())
for i := 0; i < 1000; i++ {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With old configuration this test reaches 15 second limit ~75% with N=100 and ~25% with N=50

New configuration
N=10000 max-backoff=300ms
N=1000, max-backoff=70ms
N=500, max-backoff=40ms
N=100, max-backoff=12ms


These numbers are not exactly correct(and would be different on different machines) as they changed when I removed debug and cleaned up some code. The updated code does not fail as easily anymore if N<100 but the relative difference between the reachable backoff is still same.

if errors.Is(err, errRetryTimeout) {
atomic.AddInt64(&retryCount, 1)
}
return err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning nil in here changes timing for the test so that it almost never produces the timeout anymore.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
@tonistiigi tonistiigi merged commit 5760de7 into moby:master Jun 10, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants