Pro Reliability Server

Mike Perham edited this page Nov 22, 2016 · 15 revisions

Sidekiq does what it can to not lose jobs. When it shuts down, it will push back any unfinished jobs to Redis. 99% of the time, that's sufficient. But there are limits: jobs are stored in-process while executing so if the process crashes or network connectivity goes down, the job can be lost.

To handle those edge cases, the job must remain in Redis while Sidekiq executes it. Sidekiq Pro provides three different algorithms to do just that. To activate one, add this to your initializer:

Sidekiq.configure_server do |config|
  # uncomment one!
  # config.reliable_fetch!
  # config.timed_fetch!
  # config.super_fetch!
end

reliable_fetch

This is the algorithm that Sidekiq Pro has provided from Day 1. It uses the rpoplpush command and stores jobs within a private queue for each process while executing.

Pros

  • Scales to 10,000+ jobs/sec because it uses O(1) operations
  • Works with short or long jobs, 75ms or 75 minutes.
  • Old and battle tested

Cons

  • Requires stable hostnames and a unique index per-process
  • Does not work well with Heroku, Docker, Amazon's ECS or Elastic Beanstalk
  • Susceptible to "poison pill" jobs
  • Not easy to autoscale

Good choice if you are running in the traditional manner on your own servers, virtual or physical. Avoid if you are using containers or a PaaS like Heroku. If a job can crash the Ruby VM, this "poison pill" can crash your processes non-stop until the job is removed manually because jobs are retried when the process restarts.

timed_fetch

This is a new algorithm introduced in Sidekiq Pro v3.1. It stores jobs within a "pending" area with a timeout. If the job execution is not finished and acknowledged by the Sidekiq process within that timeout period, the job can be pushed back onto the queue for another process to pick up.

Pros

  • No special configuration required
  • Works in every deployment environment, containers or not
  • Handles "poison pills" gracefully
  • Works well with autoscaling

Cons

  • Less scalable because it uses O(log N) operations
  • All jobs must finish within the global job timeout or they can be re-executed

Good choice for anyone processing less than 50M jobs/day or wanting to use containers. Jobs which crash the Ruby VM, "poison pills", are not retried until the timeout is up (default of one hour) so they can't crash Sidekiq non-stop, only once per hour.

super_fetch

This is the newest algorithm and attempts to solve all the existing drawbacks. It stores jobs within a private queue. If the process dies, its private queues are cleaned up with lingering jobs pushed back to the public queues for re-execution.

Pros

  • No special configuration required
  • Works in every deployment environment, containers or not
  • Handles "poison pills" gracefully
  • Works well with autoscaling
  • Uses O(1) operations
  • Jobs can take any amount of time

Cons

  • Young - introduced in Sidekiq Pro 3.4, will replace reliable_fetch in Sidekiq Pro 4.0.