-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential Backoff Broken with Batchable Workers #301
Comments
hi @sallustfire Thanks a lot of the very detailed explanation. <3 I think the possible solutions would be:
I'm in between 1 and 2, more inclined to 1. WDYT? cc/ @mariokostelac |
Thanks for getting back to me. As an end user, I would strongly prefer 1 to 2. If exponential backoff is not working, it's pretty important, and I would eagerly want to know about that. I do have some thoughts about 3 though. I should do more testing around this, but I believe the problem described in 3 is already happening and will continue to happen even with solution 1. Imagine a queue that only has 5 messages, all are grabbed by the worker, and one of them causes an exception when being processed. All of them will be available again after the default visibility timeout elapses, and grabbed together by the worker again, which will fail because the bad message is still in the bunch. Unless there is something to cause the good messages to end up in a different batch than the bad message, won't they all eventually end up in the dead letter queue together? This would explain something we are seeing in our shoryuken deployments, but I haven't had a chance to test mechanism here yet. I know very little about the shoryuken internals, so you may be able to tell me what I just described is not possible e.g. shoryuken batch workers are not always greedy when fetching messages. However, if it is happening, it may need to be addressed independently of what is decided for the exponential backoff problem. |
@sallustfire you are right! If So for batch workers, I think we should go to option 1 (if you want to submit a PR I can point you how, see this) and probably document in the wiki about the caveats with batch workers, for example suggesting people to manually to manually delete messages in case they want more control on this process. |
@sallustfire thinking more on this, I believe 2, should be the way to go. Raising an error, would prevent Shoryuken to start and it can cause more problems to people. I think the best thing to do, would be to return is WDYT? BTW I updated the wiki for |
@phstc Happy to defer to you here, if you prefer 2 over 1. I would just reiterate that not having exponential backoff when you believe that it is configured seems like a big problem to me. It also doesn't seem out of line with other warnings that prevent shoryuken from loading such as this. Anyway, if you haven't already taken care of this, I am happy to work on a pull request. I agree that the two methods you've indicated are the ones to update. |
Show a warn message when batch isn't supported Fix #301
Hi @sallustfire For now I added the warning only #302 to avoid breaking changes.
I'm very opened to add a checker that prevents Shoryuken to start, would you like to work on it? |
We've been using shoryuken quite a bit for asynchronous jobs and more recently have been using batchable workers when appropriate to improve throughput by reducing the number of db transactions incurred by the application specific worker logic.
However, I noticed that jobs in our queue that could never complete successfully were getting moved into the dead letter queue very quickly and that the following error was appearing in our logs
ERROR: Process died, reason: undefined method `attributes' for #Array:0x007f8000042a68
After a little sleuthing I found the source of the error as shown by the following stack trace
The relevant code in exponential_backoff_retry.rb is
The problem is clearly that it is assumed that sqs_msg is a Shoryuken::Message and not an Array of Shoryuken::Message's. This is even one of the watch outs pointed out in the wiki (https://github.com/phstc/shoryuken/wiki/Middleware#be-careful-with-batchable-workers).
I'm happy to submit a pull request with a fix for this, but there unfortunately isn't one obvious solution. Naively, the thing to do would be to handle the array of messages and increase the visibility timeout on all of them as there is no way to know which messages are responsible for the failure. The downside is that if there isn't enough stochasticity in the batching, some "good" messages may get pulled into the exponential backoff. I suspect that even without the exponential back off, this is already happening to us, because we regularly encounter messages in the dead letter queue that are processed fine on their own.
What we're probably going to do for our own applications is update the middleware to process each of the messages individually after an error in processing them as a batch occurs. While this is ideal for us, I would think that it isn't suitable for everyone.
Happy to help think through or provide a fix for this bug, if you guys can provide some direction.
Thanks
The text was updated successfully, but these errors were encountered: