Skip to content

Conversation

@glbrntt
Copy link
Collaborator

@glbrntt glbrntt commented Jan 29, 2024

Motivation:

The 'testRetriesCantBeExecutedForTooManyRequestMessages' test wedges infrequently. After some diagnosis this appears to be due to the intersection of two bugs. The first is that when yielding an element to the sequence, if some consumers exist, of which a subset are waiting for the next element and the remainder are not 'slow', the continuations of the fastest consumers would not be resumed when a new element was added.

The other bug is a timing issue: when waiting for the next element a subscriber may be told to suspend. However, the subscriber must drop and then reacquire the lock to store the continuation. The state wasn't re-checked on storing the continuation which opened up a window where the element may have become present between asking for it and storing the continuation.

Modifications:

  • Don't release the lock between attempting to consume an element and suspending to wait for it.
  • Resume waiting consumers more frequently.
  • Make a few tests less flaky

Result:

Test didn't wedge in 10k iterations

Motivation:

The 'testRetriesCantBeExecutedForTooManyRequestMessages' test wedges
infrequently. After some diagnosis this appears to be due to the
intersection of two bugs. The first is that when yielding an element to
the sequence, if some consumers exist, of which a subset are waiting for
the next element and the remainder are not 'slow', the continuations of
the fastest consumers would not be resumed when a new element was added.

The other bug is a timing issue: when waiting for the next element a
subscriber may be told to suspend. However, the subscriber must drop and
then reacquire the lock to store the continuation. The state wasn't
re-checked on storing the continuation which opened up a window where
the element may have become present between asking for it and storing
the continuation.

Modifications:

- Don't release the lock between attempting to consume an element and
  suspending to wait for it.
- Resume waiting consumers more frequently.
- Make a few tests less flaky

Result:

Test didn't wedge in 10k iterations
@glbrntt glbrntt requested a review from gjcairo January 29, 2024 17:27
@glbrntt glbrntt added the version/v2 Relates to v2 label Jan 29, 2024
@glbrntt glbrntt enabled auto-merge (squash) January 29, 2024 17:42
@glbrntt glbrntt merged commit 2c27ad5 into grpc:main Jan 30, 2024
@glbrntt glbrntt deleted the fix-flaky-tests branch January 30, 2024 10:54
glbrntt added a commit to glbrntt/grpc-swift that referenced this pull request Feb 5, 2024
…grpc#1781)

Motivation:

The 'testRetriesCantBeExecutedForTooManyRequestMessages' test wedges
infrequently. After some diagnosis this appears to be due to the
intersection of two bugs. The first is that when yielding an element to
the sequence, if some consumers exist, of which a subset are waiting for
the next element and the remainder are not 'slow', the continuations of
the fastest consumers would not be resumed when a new element was added.

The other bug is a timing issue: when waiting for the next element a
subscriber may be told to suspend. However, the subscriber must drop and
then reacquire the lock to store the continuation. The state wasn't
re-checked on storing the continuation which opened up a window where
the element may have become present between asking for it and storing
the continuation.

Modifications:

- Don't release the lock between attempting to consume an element and
  suspending to wait for it.
- Resume waiting consumers more frequently.
- Make a few tests less flaky

Result:

Test didn't wedge in 10k iterations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

version/v2 Relates to v2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants