Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

static stride scheduler is flakey/not thread-safe #10433

Closed
tonyjongyoonan opened this issue Jul 30, 2023 · 3 comments · Fixed by #10437 or #10438
Closed

static stride scheduler is flakey/not thread-safe #10433

tonyjongyoonan opened this issue Jul 30, 2023 · 3 comments · Fixed by #10437 or #10438
Assignees
Milestone

Comments

@tonyjongyoonan
Copy link
Contributor

tonyjongyoonan commented Jul 30, 2023

It seems like the static stride scheduler in the WeightedRoundRobinLoadBalancer is flakey/not thread-safe for a different reason as the bug described in #10366 (and fixed in #10370).

In rare cases, WeightedRoundRobinLoadBalancerTest.pickFromOtherThread requires two pass throughs when we have multiple threads, but we should ever only need at most one pass through for a pick. But more importantly, it times out in rare cases.

The issue looks like it's not with scheduler itself but with the testing. It seems like the assert statement keeping track of the iterations is what causes this issue, as removing it solves all instances of the timeouts. I am guessing that the scheduler may not actually require two pass throughs even with multiple threads since the sequence is atomically increased, but that something involving the counting logic is not thread-safe.

@ejona86
Copy link
Member

ejona86 commented Jul 31, 2023

requires two pass throughs (3 loops instead of 2) when we have multiple threads, but we should ever only need one pass through for a pick.

That makes sense. Because the threads would share the atomic integer. The pick rate is still "at most one iteration through the list," but since there are two threads it may need two total iterations.

@tonyjongyoonan
Copy link
Contributor Author

You're right and I overlooked that. I'm still not sure what could've caused the timeout--are assert statements not thread-safe? Also, should this changed be merged into master as well as backported into v1.57?

@ejona86
Copy link
Member

ejona86 commented Jul 31, 2023

Yeah, I think a backport is appropriate.

When an LB throws an exception, it puts the Channel in panic mode. It's fairly common for tests to time out on failure because the test is written "wait for event X" and then that event never happens.

@ejona86 ejona86 modified the milestones: Next, 1.58 Aug 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.