Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Verification] Fixes flakey tests and race conditions #919

Merged
merged 14 commits into from Jul 9, 2021

Conversation

yhassanzadeh13
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 commented Jul 2, 2021

This PR addresses three issues that cause, or can cause race condition:

  • The Run method of testify is not concurrency-safe. Hence, we audit the code and found places where it can cause race conditions and fixed those.

  • There is an inherited race condition between Run and AssertExpectationForObjects methods of testify which can cause the test to jump into expectation assertion while the mocked method is not yet returned, and result in a flakey behavior. Hence, we audited the code and found places that we evaluate the expected number of calls by both WaitGroups (within a Run) and AssertExpectationForObjects, e.g.,

      wg.Wait(x)
      verifier.On("ProcessLocal", mock.Anything).
      Run(func(args mock.Arguments) {
          // do some processings
          
          wg.Done()
      }).Return(nil).Times(x)
    
      unittest.RequireReturnsBefore(t, wg.Wait, 1*time.Second, "could not process chunks on time")
      mock.AssertExpectationsForObjects(t, verifier)
    

    In such scenarios, we normally use WaitGroups to avoid indefinite blocking of the code in case the expected number of calls is not made. However, the same WaitGroup also instruments the expected number of calls, i.e., it does not unblock until it receives x expected calls. But the moment it unblocks, the process proceeds to the AssertExpectationsForObjects, while the original mock method may have not yet returned (inherited race condition of testify), so it have not yet satisfied the expected number of completely returned calls and can cause a falkey behavior when combined with AssertExpectationsForObjects.
    To fix this issue, we remove mock.AssertExpectationsForObjects(t, verifier) for places like the example above where we can verify the expected number of calls as a side-benefit of WaitGroups. Note that we do not use WaitGroups primarily to assert expectations, rather we use them to imply a timeout.

  • We instrument the code through -race flag of go and found one place that we cause a race condition between the RequestHistory and UpdateRequestHistory methods of ChunkRequests, and fixed it.

@yhassanzadeh13 yhassanzadeh13 marked this pull request as ready for review July 2, 2021 23:39
@yhassanzadeh13 yhassanzadeh13 requested review from jordanschalm and zhangchiqing and removed request for ramtinms July 2, 2021 23:40

return request.Attempt, request.LastAttempt, request.RetryAfter, true
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fix the race condition between this method and UpdateRequestHistory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this fix the race condition? Both this and Backend.ByID acquire the lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the race condition is between IncrementAttempt and RequestHistory .

In the previous approach, when RequestHistory got the request object from ByID, it inspects the Attempt field, meanwhile IncrementAttempt could concurrently updating the Attempt field, causing race condition.

If we use Backend.Run instead, then IncrementAttempt will not be blocked waiting until RequestHistory is done with inspecting the fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByID returns the reference (entity) to the chunk request status, and releases the lock right upon return. So the following piece of code is not protected by any lock. Let's call this part "unprotected":

	if !exists {
		return 0, time.Time{}, time.Duration(0), false
	}
	request := toChunkRequestStatus(entity)

	return request.Attempt, request.LastAttempt, request.RetryAfter, true

The race condition happens when the request for the same chunk ID is updated concurrently by the UpdateRequestHistory while the RequestHistory is in its unprotected part (though its ByID call returned). Since the ByID released the lock, the UpdateRequestHistory can obtain the lock immediately and starts updating the request for the same chunk ID that is read by the unprotected part of concurrent RequestHistory invocation -> race condition. I examined this hypothesis by -race flag as well.

Now by using the Run method, we encapsulate the entire body of RequestHistory logic by a mutex lock. So, no UpdateHistory can get through while a RequestHistory call is in process.

cc @zhangchiqing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the race condition is between IncrementAttempt and RequestHistory .

@zhangchiqing in the current code IncrementAttempt is never used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea I thought about this morning is to add a RWLock to Request object, so that we can use ByID to query the Request object, and use a Read lock to read the fields.

The good part is that we no longer need to lock the entire Backend (by Run) if we only need to inspect one Request.

@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2021

Codecov Report

Merging #919 (2035d69) into master (bb2d0da) will increase coverage by 0.01%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #919      +/-   ##
==========================================
+ Coverage   54.90%   54.92%   +0.01%     
==========================================
  Files         277      277              
  Lines       18331    18339       +8     
==========================================
+ Hits        10065    10073       +8     
  Misses       6902     6902              
  Partials     1364     1364              
Flag Coverage Δ
unittests 54.92% <84.61%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/mempool/stdmap/chunk_requests.go 55.55% <84.61%> (+6.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb2d0da...2035d69. Read the comment docs.

module/mempool/stdmap/chunk_requests.go Outdated Show resolved Hide resolved

return request.Attempt, request.LastAttempt, request.RetryAfter, true
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this fix the race condition? Both this and Backend.ByID acquire the lock.

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

return request.Attempt, request.LastAttempt, request.RetryAfter, true
err := cs.Backend.Run(func(backdata map[flow.Identifier]flow.Entity) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the race condition is between IncrementAttempt and RequestHistory .

In the previous approach, when RequestHistory got the request object from ByID, it inspects the Attempt field, meanwhile IncrementAttempt could concurrently updating the Attempt field, causing race condition.

If we use Backend.Run instead, then IncrementAttempt will not be blocked waiting until RequestHistory is done with inspecting the fields.

@yhassanzadeh13 yhassanzadeh13 merged commit f7c951f into master Jul 9, 2021
@yhassanzadeh13 yhassanzadeh13 deleted the yahya/4313-fix-run-method branch July 9, 2021 00:29
@m4ksio
Copy link
Contributor

m4ksio commented Jul 9, 2021

I just wonder - is it worth contributing back to testify? This seems like bug they might be interested to get fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants