Skip to content

Fix deadlock in proposal rate limiter #4453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 9, 2020
Merged

Conversation

ashish-goswami
Copy link
Contributor

@ashish-goswami ashish-goswami commented Dec 20, 2019

Currently we use channel based rate limiter to limit number of pending proposals. Each node before proposing, sends some fixed number of struct{} to proposal channel. If sending is successful, it tries to perform proposal. Number of structs{} to be send to proposal channel, increases exponentially with each errInternalRetry.
Current implementation can lead to deadlock in some cases. Consider a case when multiple routines are trying to send more than one struct{} (it is possible after first try) to proposal channel. All go routines can block if they have only sent partial number of struct{} and proposal channel is full. This can lead to deadlock.
This PR replaces channel based rate limiter with conditional variable based rate limiter. Instead of sending values one by one to channel, it adds whole count in one go. Any goroutine can do this if current counter for pending proposals is less than max pending proposals allowed. After successful proposal or each try, goroutine reduces counter by same value.

This change is Reviewable

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r4.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @manishrjain)


worker/proposal.go, line 100 at r4 (raw file):

	rl.c.L.Lock()
	rl.iou -= weight

you might want to add a comment explaining why we are subtracting the weight instead of adding it like before.


worker/proposal_test.go, line 19 at r4 (raw file):

package worker

// import (

is this meant to be uncommented at some point?


worker/worker.go, line 55 at r4 (raw file):

func Init(ps *badger.DB) {
	pstore = ps
	// needs to be initialized after group config

is this comment still valid?

Copy link
Contributor Author

@ashish-goswami ashish-goswami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @manishrjain, and @martinmr)


worker/proposal.go, line 100 at r4 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

you might want to add a comment explaining why we are subtracting the weight instead of adding it like before.

Done.


worker/proposal_test.go, line 19 at r4 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

is this meant to be uncommented at some point?

This was supposed to be a long running test to detect deadlock, I changed to run with normal tests.


worker/worker.go, line 55 at r4 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

is this comment still valid?

I think yes as it was initialising proposal chan before. Now it doing the same for limiter.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add benchmarks and gather CPU profiling.

Reviewed 2 of 5 files at r4, 2 of 2 files at r5.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @martinmr)


worker/proposal_test.go, line 34 at r5 (raw file):

// which returns errInternalRetry 50% of the time. Rest of the time it just sleeps for 1 second
// to emulate successful response.
func propposeAndWaitEmulator() error {

proposeAndWait...


worker/proposal_test.go, line 66 at r5 (raw file):

// multiple goroutines. At the end it matches if sum of completed and aborted proposals is
// equal to tried proposals or not.
func TestLimiter(t *testing.T) {

TestLimiterDeadlock


worker/proposal_test.go, line 69 at r5 (raw file):

	rand.Seed(time.Now().UnixNano())

	toTry := int64(1000) // total proposals count to propose.

Maybe increase to 3 or 5 to 10k.


worker/proposal_test.go, line 72 at r5 (raw file):

	var currentCount, pending, completed, aborted int64

	limiter = rateLimiter{c: sync.NewCond(&sync.Mutex{}), max: 256}

We also need a benchmark for rate limiter. And ideally you do some profiling to ensure that it is optimized.

Copy link
Contributor Author

@ashish-goswami ashish-goswami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 4 files reviewed, 7 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)


worker/proposal_test.go, line 34 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

proposeAndWait...

Done.


worker/proposal_test.go, line 66 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

TestLimiterDeadlock

Done.


worker/proposal_test.go, line 69 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Maybe increase to 3 or 5 to 10k.

Done.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r6.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)


worker/proposal.go, line 70 at r6 (raw file):

		rl.c.L.Unlock()
		// Pending proposals is tracking ious.
		ostats.Record(context.Background(), x.PendingProposals.M(int64(iou)))

Check if ious are going to add up, or be seen as the absolute measurement.

Copy link
Contributor Author

@ashish-goswami ashish-goswami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @manishrjain, and @martinmr)


worker/proposal.go, line 70 at r6 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Check if ious are going to add up, or be seen as the absolute measurement.

checked it, its going to be absolute measurement.


worker/proposal_test.go, line 72 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

We also need a benchmark for rate limiter. And ideally you do some profiling to ensure that it is optimized.

Done.

@ashish-goswami ashish-goswami merged commit 04eafb1 into master Jan 9, 2020
@ashish-goswami ashish-goswami deleted the mrjn/rate-limiter branch January 9, 2020 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants