Similar retry times for multiple failed jobs causes jobs to fail again #480

Closed
bbhoss opened this Issue Oct 30, 2012 · 6 comments

Projects

None yet

3 participants

@bbhoss
Contributor
bbhoss commented Oct 30, 2012

So I'm running into a problem where I have a full queue of jobs that are super-db intensive. This situation is very rare with my app, but if there were a specific failure it would cause me to hit it. The issue is that when these jobs execute, some of them time out because the combined load on the db causes the queries to either lock on one another or just take much longer in general. When this happens, the jobs' execution time expires (as specified by the timeout option) and they go into the retries "queue."

The actual issue is when the jobs are retried. Since they are all retried around the same time, the exact same issue happens, and the cycle continues. To me, a possible solution is to have a "splay" added to the retry time that causes there to be an increasing difference in actual retry times. It seems like this would be a simple enough thing to add, as it should just be a simple multiplication of a random number in the algorithm that determines the retry time. Would you be interested in this feature?

@jc00ke
Collaborator
jc00ke commented Oct 30, 2012

There is already a delay for retries but you're probably running into an edge case where, when all of your retries run, they themselves cause the problem.

I wonder if you could handle this by using another queue (do we have custom retry queues?) and firing up another sidekiq server to serially process the retries?

@mperham
Owner
mperham commented Oct 30, 2012

@bbhoss That would be rad. I've run into the same issue before but not badly enough for me to fix it.

@bbhoss
Contributor
bbhoss commented Oct 30, 2012

Cool, I'll work on a pull request then. It looks like I only need to tweak the formula here, correct?

@mperham
Owner
mperham commented Oct 30, 2012

Exactly.

@bbhoss
Contributor
bbhoss commented Oct 30, 2012

@mperham It doesn't look like you have any existing test infrastructure for the retry_at time. Would you like me to add some or is a simple 1-line patch to the DELAY proc ok?

@mperham
Owner
mperham commented Oct 30, 2012

I'm ok with a one line patch.

@mperham mperham pushed a commit that closed this issue Oct 30, 2012
@bbhoss bbhoss Added a splay to the delay time for retried jobs. [Fixes #480]
The amount of splay increases as the number of failures increases. The maximum amount of splay is equal to 30 * (the number of times the job has failed, plus 1). For the default max retries setting, the maximum splay would be 780 seconds.
b08696b
@mperham mperham closed this in b08696b Oct 30, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment