Exponential Backoff Broken with Batchable Workers #301

sallustfire · 2016-12-31T19:43:26Z

We've been using shoryuken quite a bit for asynchronous jobs and more recently have been using batchable workers when appropriate to improve throughput by reducing the number of db transactions incurred by the application specific worker logic.

However, I noticed that jobs in our queue that could never complete successfully were getting moved into the dead letter queue very quickly and that the following error was appearing in our logs

ERROR: Process died, reason: undefined method `attributes' for #Array:0x007f8000042a68

After a little sleuthing I found the source of the error as shown by the following stack trace

shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:23:in `handle_failure'
shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:13:in `rescue in call'
shoryuken-2.1.2/lib/shoryuken/middleware/server/exponential_backoff_retry.rb:8:in `call'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:96:in `block in invoke'
shoryuken-2.1.2/lib/shoryuken/middleware/server/timing.rb:14:in `block in call'
shoryuken-2.1.2/lib/shoryuken/logging.rb:21:in `with_context'
shoryuken-2.1.2/lib/shoryuken/middleware/server/timing.rb:8:in `call'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:96:in `block in invoke'
shoryuken-2.1.2/lib/shoryuken/middleware/chain.rb:99:in `invoke'
shoryuken-2.1.2/lib/shoryuken/processor.rb:18:in `process'
celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

The relevant code in exponential_backoff_retry.rb is

def handle_failure(sqs_msg, started_at, retry_intervals)
          return unless attempts = sqs_msg.attributes['ApproximateReceiveCount']

          attempts = attempts.to_i - 1
...

The problem is clearly that it is assumed that sqs_msg is a Shoryuken::Message and not an Array of Shoryuken::Message's. This is even one of the watch outs pointed out in the wiki (https://github.com/phstc/shoryuken/wiki/Middleware#be-careful-with-batchable-workers).

I'm happy to submit a pull request with a fix for this, but there unfortunately isn't one obvious solution. Naively, the thing to do would be to handle the array of messages and increase the visibility timeout on all of them as there is no way to know which messages are responsible for the failure. The downside is that if there isn't enough stochasticity in the batching, some "good" messages may get pulled into the exponential backoff. I suspect that even without the exponential back off, this is already happening to us, because we regularly encounter messages in the dead letter queue that are processed fine on their own.

What we're probably going to do for our own applications is update the middleware to process each of the messages individually after an error in processing them as a batch occurs. While this is ideal for us, I would think that it isn't suitable for everyone.

Happy to help think through or provide a fix for this bug, if you guys can provide some direction.

Thanks

The text was updated successfully, but these errors were encountered:

phstc · 2017-01-04T19:54:55Z

hi @sallustfire

Thanks a lot of the very detailed explanation. <3

I think the possible solutions would be:

Do not allow using exponential backoff for batchable workers (it is kind of happening now because of the exception). So we could raise an error "You cannot use batch and exponential for together" something like that.
Similar to 1. but instead log a warn and do nothing.
Similar to 2. but update all messages. Unfortunately, some that processed fine would be requeued too, as you pointed out, which can be confusing and error-prone for some messages.

I'm in between 1 and 2, more inclined to 1. WDYT?

cc/ @mariokostelac

sallustfire · 2017-01-05T00:00:15Z

@phstc

Thanks for getting back to me. As an end user, I would strongly prefer 1 to 2. If exponential backoff is not working, it's pretty important, and I would eagerly want to know about that.

I do have some thoughts about 3 though. I should do more testing around this, but I believe the problem described in 3 is already happening and will continue to happen even with solution 1. Imagine a queue that only has 5 messages, all are grabbed by the worker, and one of them causes an exception when being processed. All of them will be available again after the default visibility timeout elapses, and grabbed together by the worker again, which will fail because the bad message is still in the bunch. Unless there is something to cause the good messages to end up in a different batch than the bad message, won't they all eventually end up in the dead letter queue together? This would explain something we are seeing in our shoryuken deployments, but I haven't had a chance to test mechanism here yet.

I know very little about the shoryuken internals, so you may be able to tell me what I just described is not possible e.g. shoryuken batch workers are not always greedy when fetching messages. However, if it is happening, it may need to be addressed independently of what is decided for the exponential backoff problem.

phstc · 2017-01-05T01:57:15Z

@sallustfire you are right! If auto_delete is set to true, Shoryuken only deletes the messages if the the worker does not raise any exception, but If it raises an error, none will get deleted, consequently getting back to the main queue or DL if any configured/max retries.

So for batch workers, I think we should go to option 1 (if you want to submit a PR I can point you how, see this) and probably document in the wiki about the caveats with batch workers, for example suggesting people to manually to manually delete messages in case they want more control on this process.

phstc · 2017-01-05T13:00:22Z

@sallustfire thinking more on this, I believe 2, should be the way to go. Raising an error, would prevent Shoryuken to start and it can cause more problems to people.

I think the best thing to do, would be to return is sqs_msg is an array here and log.warn, and probably the same here.

WDYT?

BTW I updated the wiki for batch, auto_visibility_timeout and retry_intervals.

sallustfire · 2017-01-08T20:55:42Z

@phstc Happy to defer to you here, if you prefer 2 over 1. I would just reiterate that not having exponential backoff when you believe that it is configured seems like a big problem to me. It also doesn't seem out of line with other warnings that prevent shoryuken from loading such as this.

Anyway, if you haven't already taken care of this, I am happy to work on a pull request. I agree that the two methods you've indicated are the ones to update.

Show a warn message when batch isn't supported Fix #301

phstc · 2017-01-11T10:58:38Z

Hi @sallustfire

For now I added the warning only #302 to avoid breaking changes.

I am happy to work on a pull request. I agree that the two methods you've indicated are the ones to update.

I'm very opened to add a checker that prevents Shoryuken to start, would you like to work on it?

phstc mentioned this issue Jan 9, 2017

Show a warn message when batch isn't supported #302

Merged

phstc closed this as completed in #302 Jan 11, 2017

phstc added a commit that referenced this issue Jan 11, 2017

Merge pull request #302 from phstc/hotfix/301

9fbe6e2

Show a warn message when batch isn't supported Fix #301

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential Backoff Broken with Batchable Workers #301

Exponential Backoff Broken with Batchable Workers #301

sallustfire commented Dec 31, 2016

phstc commented Jan 4, 2017 •

edited

Loading

sallustfire commented Jan 5, 2017 •

edited

Loading

phstc commented Jan 5, 2017

phstc commented Jan 5, 2017

sallustfire commented Jan 8, 2017

phstc commented Jan 11, 2017 •

edited

Loading

Exponential Backoff Broken with Batchable Workers #301

Exponential Backoff Broken with Batchable Workers #301

Comments

sallustfire commented Dec 31, 2016

phstc commented Jan 4, 2017 • edited Loading

sallustfire commented Jan 5, 2017 • edited Loading

phstc commented Jan 5, 2017

phstc commented Jan 5, 2017

sallustfire commented Jan 8, 2017

phstc commented Jan 11, 2017 • edited Loading

phstc commented Jan 4, 2017 •

edited

Loading

sallustfire commented Jan 5, 2017 •

edited

Loading

phstc commented Jan 11, 2017 •

edited

Loading