Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential Backoff Broken with Batchable Workers #301

Closed
sallustfire opened this issue Dec 31, 2016 · 6 comments
Closed

Exponential Backoff Broken with Batchable Workers #301

sallustfire opened this issue Dec 31, 2016 · 6 comments

Comments

@sallustfire
Copy link

We've been using shoryuken quite a bit for asynchronous jobs and more recently have been using batchable workers when appropriate to improve throughput by reducing the number of db transactions incurred by the application specific worker logic.

However, I noticed that jobs in our queue that could never complete successfully were getting moved into the dead letter queue very quickly and that the following error was appearing in our logs

ERROR: Process died, reason: undefined method `attributes' for #Array:0x007f8000042a68

After a little sleuthing I found the source of the error as shown by the following stack trace

shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:23:in `handle_failure'
shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:13:in `rescue in call'
shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:8:in `call'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:96:in `block in invoke'
shoryuken-2.1.2/lib/shoryuken/middleware/server/timing.rb:14:in `block in call'
shoryuken-2.1.2/lib/shoryuken/logging.rb:21:in `with_context'
shoryuken-2.1.2/lib/shoryuken/middleware/server/timing.rb:8:in `call'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:96:in `block in invoke'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:99:in `invoke'
shoryuken-2.1.2/lib/shoryuken/processor.rb:18:in `process'
celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

The relevant code in exponential_backoff_retry.rb is

def handle_failure(sqs_msg, started_at, retry_intervals)
          return unless attempts = sqs_msg.attributes['ApproximateReceiveCount']

          attempts = attempts.to_i - 1
...

The problem is clearly that it is assumed that sqs_msg is a Shoryuken::Message and not an Array of Shoryuken::Message's. This is even one of the watch outs pointed out in the wiki (https://github.com/phstc/shoryuken/wiki/Middleware#be-careful-with-batchable-workers).

I'm happy to submit a pull request with a fix for this, but there unfortunately isn't one obvious solution. Naively, the thing to do would be to handle the array of messages and increase the visibility timeout on all of them as there is no way to know which messages are responsible for the failure. The downside is that if there isn't enough stochasticity in the batching, some "good" messages may get pulled into the exponential backoff. I suspect that even without the exponential back off, this is already happening to us, because we regularly encounter messages in the dead letter queue that are processed fine on their own.

What we're probably going to do for our own applications is update the middleware to process each of the messages individually after an error in processing them as a batch occurs. While this is ideal for us, I would think that it isn't suitable for everyone.

Happy to help think through or provide a fix for this bug, if you guys can provide some direction.

Thanks

@phstc
Copy link
Collaborator

phstc commented Jan 4, 2017

hi @sallustfire

Thanks a lot of the very detailed explanation. <3

I think the possible solutions would be:

  1. Do not allow using exponential backoff for batchable workers (it is kind of happening now because of the exception). So we could raise an error "You cannot use batch and exponential for together" something like that.
  2. Similar to 1. but instead log a warn and do nothing.
  3. Similar to 2. but update all messages. Unfortunately, some that processed fine would be requeued too, as you pointed out, which can be confusing and error-prone for some messages.

I'm in between 1 and 2, more inclined to 1. WDYT?

cc/ @mariokostelac

@sallustfire
Copy link
Author

sallustfire commented Jan 5, 2017

@phstc

Thanks for getting back to me. As an end user, I would strongly prefer 1 to 2. If exponential backoff is not working, it's pretty important, and I would eagerly want to know about that.

I do have some thoughts about 3 though. I should do more testing around this, but I believe the problem described in 3 is already happening and will continue to happen even with solution 1. Imagine a queue that only has 5 messages, all are grabbed by the worker, and one of them causes an exception when being processed. All of them will be available again after the default visibility timeout elapses, and grabbed together by the worker again, which will fail because the bad message is still in the bunch. Unless there is something to cause the good messages to end up in a different batch than the bad message, won't they all eventually end up in the dead letter queue together? This would explain something we are seeing in our shoryuken deployments, but I haven't had a chance to test mechanism here yet.

I know very little about the shoryuken internals, so you may be able to tell me what I just described is not possible e.g. shoryuken batch workers are not always greedy when fetching messages. However, if it is happening, it may need to be addressed independently of what is decided for the exponential backoff problem.

@phstc
Copy link
Collaborator

phstc commented Jan 5, 2017

@sallustfire you are right! If auto_delete is set to true, Shoryuken only deletes the messages if the the worker does not raise any exception, but If it raises an error, none will get deleted, consequently getting back to the main queue or DL if any configured/max retries.

So for batch workers, I think we should go to option 1 (if you want to submit a PR I can point you how, see this) and probably document in the wiki about the caveats with batch workers, for example suggesting people to manually to manually delete messages in case they want more control on this process.

@phstc
Copy link
Collaborator

phstc commented Jan 5, 2017

@sallustfire thinking more on this, I believe 2, should be the way to go. Raising an error, would prevent Shoryuken to start and it can cause more problems to people.

I think the best thing to do, would be to return is sqs_msg is an array here and log.warn, and probably the same here.

WDYT?

BTW I updated the wiki for batch, auto_visibility_timeout and retry_intervals.

@sallustfire
Copy link
Author

@phstc Happy to defer to you here, if you prefer 2 over 1. I would just reiterate that not having exponential backoff when you believe that it is configured seems like a big problem to me. It also doesn't seem out of line with other warnings that prevent shoryuken from loading such as this.

Anyway, if you haven't already taken care of this, I am happy to work on a pull request. I agree that the two methods you've indicated are the ones to update.

phstc added a commit that referenced this issue Jan 11, 2017
Show a warn message when batch isn't supported

Fix #301
@phstc
Copy link
Collaborator

phstc commented Jan 11, 2017

Hi @sallustfire

For now I added the warning only #302 to avoid breaking changes.

I am happy to work on a pull request. I agree that the two methods you've indicated are the ones to update.

I'm very opened to add a checker that prevents Shoryuken to start, would you like to work on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants