More atomic poller operations #843

Merged
merged 2 commits into from Apr 12, 2013

Conversation

3 participants
@bdurand
Contributor

bdurand commented Apr 11, 2013

I recently had an issue where my application lost a very large number of jobs. The problem was with future scheduled jobs being popped from the scheduled queue but never pushed onto one of the worker queues. Here are the steps that happened:

  1. We performed some redis maintenance which took the server down for a few minutes.
  2. Due to a bug in scheduled.rb (since fixed see issue #309) the poller stopped querying the scheduled queue.
  3. When redis came back online, sidekiq started processing all workers executed with perform_async. Any workers called with perform_in were put on the scheduled queue but never popped off due to the stopped poller.
  4. Because sidekiq was processing some jobs just fine, monitoring did not pick up that there was a problem.
  5. After almost 20 hours sidekiq got restarted. This started the poller again.
  6. At this point there were over 2 million jobs in the scheduled queue.
  7. The poller on one of the sidekiq processes would have tried to pop all 2 million jobs at once off the scheduled queue. It than would have tried to insert them one at a time on to the appropriate worker queues. Something happened in the middle of this that aborted the process some where between the pop and finishing the pushes. It could have been an out of memory error or it could have been monitoring thinking there was a run away process due to the ballooning memory.

The end result was that most of the jobs were irretrievably lost. This had also happened two weeks earlier during some other redis maintenance.

This pull request changes the logic in Poller to pop messages one at a time from the retry and schedule queues and immediately push them to the appropriate worker queue. The new logic is:

  1. Get the next item in the queue if it's score (time to execute) is <= now
  2. If an item exists, then try to remove the item from the queue
  3. If item successfully removed, then add it to the appropriate worker queue
  4. If item not successfully removed, then try again (it was likely removed by another process)
  5. Repeat until no items have a score that matches <= now
Brian Durand
Change poller to pop messages one at a time so they don't get lost if…
… something goes wrong during a large pop operation.
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 11, 2013

Coverage increased (+0.42%) when pulling 0c76c3b on weheartit:master into 387a646 on mperham:master.

View Details

Coverage increased (+0.42%) when pulling 0c76c3b on weheartit:master into 387a646 on mperham:master.

View Details

@mperham

This comment has been minimized.

Show comment
Hide comment
@mperham

mperham Apr 11, 2013

Owner

@bdurand Ugh, sorry to hear about that.

What do you think about using small batch size of, say, 20 rather than one at a time?

Owner

mperham commented Apr 11, 2013

@bdurand Ugh, sorry to hear about that.

What do you think about using small batch size of, say, 20 rather than one at a time?

@bdurand

This comment has been minimized.

Show comment
Hide comment
@bdurand

bdurand Apr 11, 2013

Contributor

I couldn't figure out a way to do small batches since the zrembyscore method doesn't take a limit. I think popping them one at a time would do the most to reduce race conditions. The best way to solve the issue would be an atomic pop and push lua script but that would tie sidekiq to redis 2.6 (maybe a feature for 3.0).

Contributor

bdurand commented Apr 11, 2013

I couldn't figure out a way to do small batches since the zrembyscore method doesn't take a limit. I think popping them one at a time would do the most to reduce race conditions. The best way to solve the issue would be an atomic pop and push lua script but that would tie sidekiq to redis 2.6 (maybe a feature for 3.0).

lib/sidekiq/scheduled.rb
- conn.sadd('queues', msg['queue'])
- conn.rpush("queue:#{msg['queue']}", message)
+ while message = conn.zrangebyscore(sorted_set, '-inf', now, :limit => [0, 1]).first do
+ if message

This comment has been minimized.

@mperham

mperham Apr 11, 2013

Owner

I think the if is redundant, yes? The while should stop if message is falsy.

@mperham

mperham Apr 11, 2013

Owner

I think the if is redundant, yes? The while should stop if message is falsy.

This comment has been minimized.

@bdurand

bdurand Apr 11, 2013

Contributor

Yes, the if predates the while. I'll clean it up.

@bdurand

bdurand Apr 11, 2013

Contributor

Yes, the if predates the while. I'll clean it up.

lib/sidekiq/scheduled.rb
+ conn.sadd('queues', msg['queue'])
+ conn.rpush("queue:#{msg['queue']}", message)
+ end
+ logger.debug("enqueued #{sorted_set}: #{message}") if logger.debug?

This comment has been minimized.

@mperham

mperham Apr 11, 2013

Owner

Use the block form: logger.debug { " ... " } so you don't need the if.

@mperham

mperham Apr 11, 2013

Owner

Use the block form: logger.debug { " ... " } so you don't need the if.

This comment has been minimized.

@bdurand

bdurand Apr 11, 2013

Contributor

Actually the if is significantly more efficient than the block form. It still has to create the block object to call debug with. I figured since I was making the loop less efficient I'd gain a microsecond back by making the logging more efficient 😉

@bdurand

bdurand Apr 11, 2013

Contributor

Actually the if is significantly more efficient than the block form. It still has to create the block object to call debug with. I figured since I was making the loop less efficient I'd gain a microsecond back by making the logging more efficient 😉

@mperham

This comment has been minimized.

Show comment
Hide comment
@mperham

mperham Apr 11, 2013

Owner

Yeah, you're right about the Redis commands. Maybe it's by design but this loop feels dirty and I guess we can't make it any cleaner.

Would you add a comment with your five item list so the scheduler loop is documented in the code? Tests look fine.

Owner

mperham commented Apr 11, 2013

Yeah, you're right about the Redis commands. Maybe it's by design but this loop feels dirty and I guess we can't make it any cleaner.

Would you add a comment with your five item list so the scheduler loop is documented in the code? Tests look fine.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 11, 2013

Coverage increased (+0.45%) when pulling 8918096 on weheartit:master into 387a646 on mperham:master.

View Details

Coverage increased (+0.45%) when pulling 8918096 on weheartit:master into 387a646 on mperham:master.

View Details

mperham added a commit that referenced this pull request Apr 12, 2013

Merge pull request #843 from weheartit/master
More atomic poller operations

@mperham mperham merged commit 5a2ea7a into mperham:master Apr 12, 2013

1 check failed

default The Travis build failed
Details
@mperham

This comment has been minimized.

Show comment
Hide comment
@mperham

mperham Apr 12, 2013

Owner

Thank you for the hard work in finding and fixing this issue!

Owner

mperham commented Apr 12, 2013

Thank you for the hard work in finding and fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment