You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like the static stride scheduler in the WeightedRoundRobinLoadBalancer is flakey/not thread-safe for a different reason as the bug described in #10366 (and fixed in #10370).
In rare cases, WeightedRoundRobinLoadBalancerTest.pickFromOtherThread requires two pass throughs when we have multiple threads, but we should ever only need at most one pass through for a pick. But more importantly, it times out in rare cases.
The issue looks like it's not with scheduler itself but with the testing. It seems like the assert statement keeping track of the iterations is what causes this issue, as removing it solves all instances of the timeouts. I am guessing that the scheduler may not actually require two pass throughs even with multiple threads since the sequence is atomically increased, but that something involving the counting logic is not thread-safe.
The text was updated successfully, but these errors were encountered:
requires two pass throughs (3 loops instead of 2) when we have multiple threads, but we should ever only need one pass through for a pick.
That makes sense. Because the threads would share the atomic integer. The pick rate is still "at most one iteration through the list," but since there are two threads it may need two total iterations.
You're right and I overlooked that. I'm still not sure what could've caused the timeout--are assert statements not thread-safe? Also, should this changed be merged into master as well as backported into v1.57?
When an LB throws an exception, it puts the Channel in panic mode. It's fairly common for tests to time out on failure because the test is written "wait for event X" and then that event never happens.
It seems like the static stride scheduler in the WeightedRoundRobinLoadBalancer is flakey/not thread-safe for a different reason as the bug described in #10366 (and fixed in #10370).
In rare cases,
WeightedRoundRobinLoadBalancerTest.pickFromOtherThread
requires two pass throughs when we have multiple threads, but we should ever only need at most one pass through for a pick. But more importantly, it times out in rare cases.The issue looks like it's not with scheduler itself but with the testing. It seems like the
assert
statement keeping track of the iterations is what causes this issue, as removing it solves all instances of the timeouts. I am guessing that the scheduler may not actually require two pass throughs even with multiple threads since the sequence is atomically increased, but that something involving the counting logic is not thread-safe.The text was updated successfully, but these errors were encountered: