-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add backpressure when all connections are dead #698
Conversation
Thanks for taking a deep dive into this issue. I will have to review this carefully, so it might take a while. Busy times at my day job... |
Allow the user to configure the deadline after which reconnection attempts for a network error are aborted. Abort reconnection attempts if an interrupt signal is detected. Only start the reconnection process if the error implements the net.Error interface.
bulk_processor.go
Outdated
@@ -545,6 +545,7 @@ func (w *bulkWorker) waitForActiveConnection() { | |||
deadline := time.NewTimer(w.p.stopRecon) | |||
interrupt := make(chan os.Signal, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you trying to achieve with signals in Elastic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my test case I have a main process that is listening on an event source and using these events to add work to an Elastic bulk processor. This main process also listens for Ctrl-C signals in another go routine and in such a case exits the event loop and calls Close on the bulk processor.
Now, the reason I added signals to this PR is that when Elastic enters this new waiting state, calls to both Add and Close on the processor will block. Thus in my test case, with Elasticsearch down, I could no longer quit the main process with Ctrl-C since this was calling Close() but that blocks until the deadline. So I needed a way for waitForConnections to exit without needing to restart the Elasticsearch server.
Maybe I could have achieved this by adding a another channel to the bulk processor and pushing a message to it when Close() was called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then we need to change Close
in a way that it doesn't block, or at least times out without relying on signals.
I see your use case. But I think there should be a more general mechanism (besides Ctrl-C) to stop/close bulk processor cleanly. I see signals typically used with console-based apps, and don't see a good reason to rely on them outside of main
.
I have added a recipe for testing bulk processor under recipes/bulk_processor. I just added it for blackbox-testing this PR. If you're using this with Elastic before this PR, it looks like this (with the number of queued requests slowly increasing):
If this PR is applied, it looks like this (stopping ES after a few iterations, the number of queued requests in the workers no longer increases):
So from the outside, it's looking good. I have two more questions wrt this PR: Will your PR ever give up with retrying, or is it in an infinite loop, trying to commit? What is the need for waiting for an external signal with |
Hmm... there's something wrong after applying your patch. Simply run the bulk_processor recipe, then stop ES after ~10s. It starts well, then starts to enqueue again, behaving just like before:
You need to run for a few minutes though. Notice that I didn't start ES again. Can you please test this with your PR? |
Thanks for creating the recipe, that should really help figure out what's going on. I will take a look this afternoon and see if I can clean it up and remove the external signal check. Wrt your questions, currently the PR is setup to default to a 5 minute deadline after which it should give up trying to find connections. The signal check was there so that if someone externally didn't want to wait for this deadline they could also Ctrl-C to force it to stop trying immediately. I hope to remove this signal check and instead use a ReadWrite lock to ensure that both external commands Stop/Close() and Flush() do not block on dead connections. They will do this by writing to a variable that causes the connection checker to stop. Each connection checker will read a variable indicating whether or not it should give up. After that hopefully only Add() calls and the flushTicker will block on dead connections, and external Flush() and Stop/Close() will not block on dead connections. |
Sorry for not digging into the details yet. From what you describe, you would want to tell a list of goroutines to exit when calling |
Yes, closing a channel is idiomatic. I will try doing that to signal stop to the connection checkers. |
Remove the use of os signals Use a closed channel to notify connection checkers to stop Keep checking until an active connection is found or the processor is stopped
Refactored the PR. This is simpler and seems to work well against the test. Let me know whenever you get a chance to take a look. Thanks! |
That's looking good. I've been testing the patch with race detector enabled. It found one issue:
Can you reproduce that? |
Thanks I’ll take a look when I get a chance. From first glance and re reading the Godoc it looks like you should never call Wait before Add. I didn’t realize that was not allowed. I think if I ensure all the new wg.Wait calls happen after the Add/go calls then this will go away. The behavior might be slightly different as each worker would then independently find out about ES being down. So you would see each worker pausing before Add calls start blocking. |
After a network error in a worker, subsequent workers may get an ErrNoClient error so check for that also
This commit does a final clean up before merging the PR for adding back pressure into the BulkProcessor (#698). It switches from using `sync.WaitGroup` to a simple channel for signalling. It also uses a `time.Ticker` instead of a tight loop for periodically checking for active connections.
@olivere this looks really good to me. Check the comment I added previously to the code which mention waitgroup. That should be removed or reworded. Thanks very much for all your help with this. |
Oh yeah, you mean I should remove this block, right? What I find interesting is that if you keep poking, you often end up with much less code than what you've started with. Anyway, I will release a new version later today. Thanks so much for your persistence and being a good contributor to open source. 👍 |
I've noticed that if I have a BulkProcessor with a short
FlushInterval
and all connections to Elasticsearch become dead, bad things happen. Since the BulkServicerequests
slice is only reset after a successful commit, the short flush interval causes the Processor to bang the same set of requests over and over again against dead connections. Additionally, without any back pressure, producers can potentially keep adding fuel to the fire by adding to therequests
slice of the BulkService.This PR is an attempt to add some relief in this scenario. I think it can possibly be enhanced to only trigger liveness checks after errors of "type connection".
Originally, I was looking into why requests were being repeated so often on failure when I pulled the plug on Elasticsearch. My retrier was set to a StopRetrier, so something was fishy. Then I realized that the request state doesn't get cleared on failure and my FlushInterval is less than a second.
This PR can be tested starting a BulkProcessor with a sub second FlushInterval, adding some BulkableRequests, then stopping Elasticsearch, and finally adding a few more BulkableRequests.