-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix deadlock in proposal rate limiter #4453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r4.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @manishrjain)
worker/proposal.go, line 100 at r4 (raw file):
rl.c.L.Lock() rl.iou -= weight
you might want to add a comment explaining why we are subtracting the weight instead of adding it like before.
worker/proposal_test.go, line 19 at r4 (raw file):
package worker // import (
is this meant to be uncommented at some point?
worker/worker.go, line 55 at r4 (raw file):
func Init(ps *badger.DB) { pstore = ps // needs to be initialized after group config
is this comment still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 4 files reviewed, 3 unresolved discussions (waiting on @animesh2049, @manishrjain, and @martinmr)
worker/proposal.go, line 100 at r4 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
you might want to add a comment explaining why we are subtracting the weight instead of adding it like before.
Done.
worker/proposal_test.go, line 19 at r4 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
is this meant to be uncommented at some point?
This was supposed to be a long running test to detect deadlock, I changed to run with normal tests.
worker/worker.go, line 55 at r4 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
is this comment still valid?
I think yes as it was initialising proposal chan before. Now it doing the same for limiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add benchmarks and gather CPU profiling.
Reviewed 2 of 5 files at r4, 2 of 2 files at r5.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @animesh2049, @ashish-goswami, and @martinmr)
worker/proposal_test.go, line 34 at r5 (raw file):
// which returns errInternalRetry 50% of the time. Rest of the time it just sleeps for 1 second // to emulate successful response. func propposeAndWaitEmulator() error {
proposeAndWait...
worker/proposal_test.go, line 66 at r5 (raw file):
// multiple goroutines. At the end it matches if sum of completed and aborted proposals is // equal to tried proposals or not. func TestLimiter(t *testing.T) {
TestLimiterDeadlock
worker/proposal_test.go, line 69 at r5 (raw file):
rand.Seed(time.Now().UnixNano()) toTry := int64(1000) // total proposals count to propose.
Maybe increase to 3 or 5 to 10k.
worker/proposal_test.go, line 72 at r5 (raw file):
var currentCount, pending, completed, aborted int64 limiter = rateLimiter{c: sync.NewCond(&sync.Mutex{}), max: 256}
We also need a benchmark for rate limiter. And ideally you do some profiling to ensure that it is optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 4 files reviewed, 7 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)
worker/proposal_test.go, line 34 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
proposeAndWait...
Done.
worker/proposal_test.go, line 66 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
TestLimiterDeadlock
Done.
worker/proposal_test.go, line 69 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Maybe increase to 3 or 5 to 10k.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r6.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @ashish-goswami, @manishrjain, and @martinmr)
worker/proposal.go, line 70 at r6 (raw file):
rl.c.L.Unlock() // Pending proposals is tracking ious. ostats.Record(context.Background(), x.PendingProposals.M(int64(iou)))
Check if ious are going to add up, or be seen as the absolute measurement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @animesh2049, @manishrjain, and @martinmr)
worker/proposal.go, line 70 at r6 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Check if ious are going to add up, or be seen as the absolute measurement.
checked it, its going to be absolute measurement.
worker/proposal_test.go, line 72 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
We also need a benchmark for rate limiter. And ideally you do some profiling to ensure that it is optimized.
Done.
Currently we use channel based rate limiter to limit number of pending proposals. Each node before proposing, sends some fixed number of
struct{}toproposal channel. If sending is successful, it tries to perform proposal. Number ofstructs{}to be send toproposal channel, increases exponentially with eacherrInternalRetry.Current implementation can lead to deadlock in some cases. Consider a case when multiple routines are trying to send more than one
struct{}(it is possible after first try) toproposal channel. All go routines can block if they have only sent partial number ofstruct{}and proposal channel is full. This can lead to deadlock.This PR replaces channel based rate limiter with conditional variable based rate limiter. Instead of sending values one by one to channel, it adds whole count in one go. Any goroutine can do this if current counter for pending proposals is less than max pending proposals allowed. After successful proposal or each try, goroutine reduces counter by same value.
This change is