[Verification] Fixes flakey tests and race conditions #919

yhassanzadeh13 · 2021-07-02T22:41:57Z

This PR addresses three issues that cause, or can cause race condition:

The Run method of testify is not concurrency-safe. Hence, we audit the code and found places where it can cause race conditions and fixed those.
There is an inherited race condition between Run and AssertExpectationForObjects methods of testify which can cause the test to jump into expectation assertion while the mocked method is not yet returned, and result in a flakey behavior. Hence, we audited the code and found places that we evaluate the expected number of calls by both WaitGroups (within a Run) and AssertExpectationForObjects, e.g.,
```
  wg.Wait(x)
  verifier.On("ProcessLocal", mock.Anything).
  Run(func(args mock.Arguments) {
      // do some processings
      
      wg.Done()
  }).Return(nil).Times(x)

  unittest.RequireReturnsBefore(t, wg.Wait, 1*time.Second, "could not process chunks on time")
  mock.AssertExpectationsForObjects(t, verifier)
```
In such scenarios, we normally use WaitGroups to avoid indefinite blocking of the code in case the expected number of calls is not made. However, the same WaitGroup also instruments the expected number of calls, i.e., it does not unblock until it receives x expected calls. But the moment it unblocks, the process proceeds to the AssertExpectationsForObjects, while the original mock method may have not yet returned (inherited race condition of testify), so it have not yet satisfied the expected number of completely returned calls and can cause a falkey behavior when combined with AssertExpectationsForObjects.
To fix this issue, we remove mock.AssertExpectationsForObjects(t, verifier) for places like the example above where we can verify the expected number of calls as a side-benefit of WaitGroups. Note that we do not use WaitGroups primarily to assert expectations, rather we use them to imply a timeout.
We instrument the code through -race flag of go and found one place that we cause a race condition between the RequestHistory and UpdateRequestHistory methods of ChunkRequests, and fixed it.

…yahya/4313-fix-run-method

yhassanzadeh13 · 2021-07-02T23:45:40Z

module/mempool/stdmap/chunk_requests.go


-	return request.Attempt, request.LastAttempt, request.RetryAfter, true
+	err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {


to fix the race condition between this method and UpdateRequestHistory.

How does this fix the race condition? Both this and Backend.ByID acquire the lock.

I think the race condition is between IncrementAttempt and RequestHistory .

In the previous approach, when RequestHistory got the request object from ByID, it inspects the Attempt field, meanwhile IncrementAttempt could concurrently updating the Attempt field, causing race condition.

If we use Backend.Run instead, then IncrementAttempt will not be blocked waiting until RequestHistory is done with inspecting the fields.

ByID returns the reference (entity) to the chunk request status, and releases the lock right upon return. So the following piece of code is not protected by any lock. Let's call this part "unprotected":

if !exists { return 0, time.Time{}, time.Duration(0), false } request := toChunkRequestStatus(entity) return request.Attempt, request.LastAttempt, request.RetryAfter, true

The race condition happens when the request for the same chunk ID is updated concurrently by the UpdateRequestHistory while the RequestHistory is in its unprotected part (though its ByID call returned). Since the ByID released the lock, the UpdateRequestHistory can obtain the lock immediately and starts updating the request for the same chunk ID that is read by the unprotected part of concurrent RequestHistory invocation -> race condition. I examined this hypothesis by -race flag as well.

Now by using the Run method, we encapsulate the entire body of RequestHistory logic by a mutex lock. So, no UpdateHistory can get through while a RequestHistory call is in process.

cc @zhangchiqing

I think the race condition is between IncrementAttempt and RequestHistory .

@zhangchiqing in the current code IncrementAttempt is never used.

Another idea I thought about this morning is to add a RWLock to Request object, so that we can use ByID to query the Request object, and use a Read lock to read the fields.

The good part is that we no longer need to lock the entire Backend (by Run) if we only need to inspect one Request.

codecov-commenter · 2021-07-02T23:49:41Z

Codecov Report

Merging #919 (2035d69) into master (bb2d0da) will increase coverage by 0.01%.
The diff coverage is 84.61%.

@@            Coverage Diff             @@
##           master     #919      +/-   ##
==========================================
+ Coverage   54.90%   54.92%   +0.01%     
==========================================
  Files         277      277              
  Lines       18331    18339       +8     
==========================================
+ Hits        10065    10073       +8     
  Misses       6902     6902              
  Partials     1364     1364

Flag	Coverage Δ
unittests	`54.92% <84.61%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
module/mempool/stdmap/chunk_requests.go	`55.55% <84.61%> (+6.46%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb2d0da...2035d69. Read the comment docs.

module/mempool/stdmap/chunk_requests.go

jordanschalm · 2021-07-08T22:46:01Z

module/mempool/stdmap/chunk_requests.go


-	return request.Attempt, request.LastAttempt, request.RetryAfter, true
+	err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {


How does this fix the race condition? Both this and Backend.ByID acquire the lock.

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

zhangchiqing · 2021-07-08T23:30:09Z

module/mempool/stdmap/chunk_requests.go


-	return request.Attempt, request.LastAttempt, request.RetryAfter, true
+	err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {


I think the race condition is between IncrementAttempt and RequestHistory .

In the previous approach, when RequestHistory got the request object from ByID, it inspects the Attempt field, meanwhile IncrementAttempt could concurrently updating the Attempt field, causing race condition.

If we use Backend.Run instead, then IncrementAttempt will not be blocked waiting until RequestHistory is done with inspecting the fields.

m4ksio · 2021-07-09T22:38:18Z

I just wonder - is it worth contributing back to testify? This seems like bug they might be interested to get fixed

yhassanzadeh13 and others added 7 commits July 2, 2021 10:07

fixes redundancy of mock call expectation in fetcher engine

b35b818

fixes redundancy of mock calling assertion in requester engine

daf8ca2

makes run method threadsafe in test utils

816513d

fixes race condition in request history

5cd2d94

Merge branch 'master' into yahya/4313-fix-run-method

0a0dc4d

fixes test

82369d1

Merge remote-tracking branch 'origin/yahya/4313-fix-run-method' into …

160e29d

…yahya/4313-fix-run-method

yhassanzadeh13 marked this pull request as ready for review July 2, 2021 23:39

yhassanzadeh13 requested a review from ramtinms as a code owner July 2, 2021 23:39

yhassanzadeh13 requested review from jordanschalm and zhangchiqing and removed request for ramtinms July 2, 2021 23:40

yhassanzadeh13 commented Jul 2, 2021

View reviewed changes

Merge branch 'master' into yahya/4313-fix-run-method

deb6363

yhassanzadeh13 assigned zhangchiqing and jordanschalm Jul 6, 2021

yhassanzadeh13 added 4 commits July 7, 2021 09:52

Merge branch 'master' into yahya/4313-fix-run-method

8b1e516

Merge branch 'master' into yahya/4313-fix-run-method

bc0c3f6

Merge branch 'master' into yahya/4313-fix-run-method

42a2203

Merge branch 'master' into yahya/4313-fix-run-method

b6a5a76

jordanschalm reviewed Jul 8, 2021

View reviewed changes

Update module/mempool/stdmap/chunk_requests.go

37ac5b7

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

zhangchiqing approved these changes Jul 8, 2021

View reviewed changes

Merge branch 'master' into yahya/4313-fix-run-method

2035d69

yhassanzadeh13 requested a review from jordanschalm July 8, 2021 23:35

jordanschalm approved these changes Jul 9, 2021

View reviewed changes

yhassanzadeh13 merged commit f7c951f into master Jul 9, 2021

yhassanzadeh13 deleted the yahya/4313-fix-run-method branch July 9, 2021 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Verification] Fixes flakey tests and race conditions #919

[Verification] Fixes flakey tests and race conditions #919

yhassanzadeh13 commented Jul 2, 2021 •

edited

yhassanzadeh13 Jul 2, 2021

jordanschalm Jul 8, 2021

zhangchiqing Jul 8, 2021

yhassanzadeh13 Jul 8, 2021 •

edited

yhassanzadeh13 Jul 8, 2021 •

edited

zhangchiqing Jul 9, 2021 •

edited

codecov-commenter commented Jul 2, 2021 •

edited

jordanschalm Jul 8, 2021

zhangchiqing Jul 8, 2021

m4ksio commented Jul 9, 2021


		return request.Attempt, request.LastAttempt, request.RetryAfter, true
		err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {

[Verification] Fixes flakey tests and race conditions #919

[Verification] Fixes flakey tests and race conditions #919

Conversation

yhassanzadeh13 commented Jul 2, 2021 • edited

yhassanzadeh13 Jul 2, 2021

Choose a reason for hiding this comment

jordanschalm Jul 8, 2021

Choose a reason for hiding this comment

zhangchiqing Jul 8, 2021

Choose a reason for hiding this comment

yhassanzadeh13 Jul 8, 2021 • edited

Choose a reason for hiding this comment

yhassanzadeh13 Jul 8, 2021 • edited

Choose a reason for hiding this comment

zhangchiqing Jul 9, 2021 • edited

Choose a reason for hiding this comment

codecov-commenter commented Jul 2, 2021 • edited

Codecov Report

jordanschalm Jul 8, 2021

Choose a reason for hiding this comment

zhangchiqing Jul 8, 2021

Choose a reason for hiding this comment

m4ksio commented Jul 9, 2021

yhassanzadeh13 commented Jul 2, 2021 •

edited

yhassanzadeh13 Jul 8, 2021 •

edited

yhassanzadeh13 Jul 8, 2021 •

edited

zhangchiqing Jul 9, 2021 •

edited

codecov-commenter commented Jul 2, 2021 •

edited