New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Verification] Fixes flakey tests and race conditions #919
Conversation
…yahya/4313-fix-run-method
|
||
return request.Attempt, request.LastAttempt, request.RetryAfter, true | ||
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to fix the race condition between this method and UpdateRequestHistory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this fix the race condition? Both this and Backend.ByID
acquire the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the race condition is between IncrementAttempt
and RequestHistory
.
In the previous approach, when RequestHistory
got the request
object from ByID
, it inspects the Attempt
field, meanwhile IncrementAttempt
could concurrently updating the Attempt
field, causing race condition.
If we use Backend.Run
instead, then IncrementAttempt
will not be blocked waiting until RequestHistory
is done with inspecting the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByID
returns the reference (entity) to the chunk request status, and releases the lock right upon return. So the following piece of code is not protected by any lock. Let's call this part "unprotected":
if !exists {
return 0, time.Time{}, time.Duration(0), false
}
request := toChunkRequestStatus(entity)
return request.Attempt, request.LastAttempt, request.RetryAfter, true
The race condition happens when the request for the same chunk ID is updated concurrently by the UpdateRequestHistory
while the RequestHistory
is in its unprotected part (though its ByID
call returned). Since the ByID
released the lock, the UpdateRequestHistory
can obtain the lock immediately and starts updating the request for the same chunk ID that is read by the unprotected part of concurrent RequestHistory
invocation -> race condition. I examined this hypothesis by -race
flag as well.
Now by using the Run
method, we encapsulate the entire body of RequestHistory
logic by a mutex lock. So, no UpdateHistory
can get through while a RequestHistory
call is in process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the race condition is between IncrementAttempt and RequestHistory .
@zhangchiqing in the current code IncrementAttempt
is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea I thought about this morning is to add a RWLock to Request
object, so that we can use ByID
to query the Request object, and use a Read lock to read the fields.
The good part is that we no longer need to lock the entire Backend (by Run
) if we only need to inspect one Request.
Codecov Report
@@ Coverage Diff @@
## master #919 +/- ##
==========================================
+ Coverage 54.90% 54.92% +0.01%
==========================================
Files 277 277
Lines 18331 18339 +8
==========================================
+ Hits 10065 10073 +8
Misses 6902 6902
Partials 1364 1364
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
||
return request.Attempt, request.LastAttempt, request.RetryAfter, true | ||
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this fix the race condition? Both this and Backend.ByID
acquire the lock.
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
|
||
return request.Attempt, request.LastAttempt, request.RetryAfter, true | ||
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the race condition is between IncrementAttempt
and RequestHistory
.
In the previous approach, when RequestHistory
got the request
object from ByID
, it inspects the Attempt
field, meanwhile IncrementAttempt
could concurrently updating the Attempt
field, causing race condition.
If we use Backend.Run
instead, then IncrementAttempt
will not be blocked waiting until RequestHistory
is done with inspecting the fields.
I just wonder - is it worth contributing back to testify? This seems like bug they might be interested to get fixed |
This PR addresses three issues that cause, or can cause race condition:
The
Run
method oftestify
is not concurrency-safe. Hence, we audit the code and found places where it can cause race conditions and fixed those.There is an inherited race condition between
Run
andAssertExpectationForObjects
methods oftestify
which can cause the test to jump into expectation assertion while the mocked method is not yet returned, and result in a flakey behavior. Hence, we audited the code and found places that we evaluate the expected number of calls by bothWaitGroup
s (within aRun
) andAssertExpectationForObjects
, e.g.,In such scenarios, we normally use
WaitGroup
s to avoid indefinite blocking of the code in case the expected number of calls is not made. However, the sameWaitGroup
also instruments the expected number of calls, i.e., it does not unblock until it receivesx
expected calls. But the moment it unblocks, the process proceeds to theAssertExpectationsForObjects
, while the original mock method may have not yet returned (inherited race condition of testify), so it have not yet satisfied the expected number of completely returned calls and can cause a falkey behavior when combined withAssertExpectationsForObjects
.To fix this issue, we remove
mock.AssertExpectationsForObjects(t, verifier)
for places like the example above where we can verify the expected number of calls as a side-benefit ofWaitGroup
s. Note that we do not useWaitGroup
s primarily to assert expectations, rather we use them to imply a timeout.We instrument the code through
-race
flag ofgo
and found one place that we cause a race condition between theRequestHistory
andUpdateRequestHistory
methods ofChunkRequests
, and fixed it.