So I'm running into a problem where I have a full queue of jobs that are super-db intensive. This situation is very rare with my app, but if there were a specific failure it would cause me to hit it. The issue is that when these jobs execute, some of them time out because the combined load on the db causes the queries to either lock on one another or just take much longer in general. When this happens, the jobs' execution time expires (as specified by the timeout option) and they go into the retries "queue."
The actual issue is when the jobs are retried. Since they are all retried around the same time, the exact same issue happens, and the cycle continues. To me, a possible solution is to have a "splay" added to the retry time that causes there to be an increasing difference in actual retry times. It seems like this would be a simple enough thing to add, as it should just be a simple multiplication of a random number in the algorithm that determines the retry time. Would you be interested in this feature?
There is already a delay for retries but you're probably running into an edge case where, when all of your retries run, they themselves cause the problem.
I wonder if you could handle this by using another queue (do we have custom retry queues?) and firing up another sidekiq server to serially process the retries?
@bbhoss That would be rad. I've run into the same issue before but not badly enough for me to fix it.
Cool, I'll work on a pull request then. It looks like I only need to tweak the formula here, correct?
@mperham It doesn't look like you have any existing test infrastructure for the retry_at time. Would you like me to add some or is a simple 1-line patch to the DELAY proc ok?
I'm ok with a one line patch.
Added a splay to the delay time for retried jobs. [Fixes #480]
The amount of splay increases as the number of failures increases. The maximum amount of splay is equal to 30 * (the number of times the job has failed, plus 1). For the default max retries setting, the maximum splay would be 780 seconds.