Reliability

Mike Perham edited this page Jun 5, 2018 · 64 revisions

There are three aspects of reliability with Sidekiq and Redis:

  1. pushing jobs to Redis with the client, see the client reliability page.
  2. fetching jobs from Redis with the server, see below.
  3. scheduling jobs, see below.

Setup

TL;DR To use the Reliability features in Sidekiq Pro, add this to your initializer:

Sidekiq::Client.reliable_push! unless Rails.env.test?

Sidekiq.configure_server do |config|
  config.super_fetch!
  config.reliable_scheduler!
end

Read on for more detail. This screencast gives a quick overview:

Reliability

Using super_fetch

Sidekiq uses BRPOP to fetch a job from the queue in Redis. This is very efficient and simple but it has one drawback: the job is now removed from Redis. If Sidekiq crashes while processing that job, it is lost forever. This is not a problem for many but some businesses need absolute reliability when processing jobs.

Sidekiq does its best to never lose jobs but it can't guarantee it; the only way to guarantee job durability is to not remove it from Redis until it is complete. For instance, if Sidekiq is restarted mid-job, it will try to push the unfinished jobs back to Redis but networking issues can prevent this.

Sidekiq Pro offers an alternative fetch strategy, super_fetch, for job processing using Redis' RPOPLPUSH command which keeps jobs in Redis. To enable super_fetch:

Sidekiq.configure_server do |config|
  # This needs to be within the configure_server block
  config.super_fetch!
end

When Sidekiq starts, you should see SuperFetch activated:

INFO: Sidekiq Pro 3.5.0, commercially licensed.  Thanks for your support!
INFO: Booting Sidekiq 5.0.0 with redis options {:url=>nil}
INFO: Starting processing, hit Ctrl-C to stop
INFO: SuperFetch activated

Recovering Jobs

When a Sidekiq process dies, its jobs in progress become orphans. On process startup, super_fetch will look for orphaned jobs:

  1. if the process's heartbeat has expired (it takes 60 seconds to expire); AND
  2. if an hour has passed since the last orphan check

The orphan check requires a complete SCAN of the Redis database; it can take a substantial amount of time (i.e. over a few seconds) if your Redis database has a lot of keys. As always, I recommend using a separate Redis database or instance for cache data vs job data. The hour buffer prevents Sidekiq from slamming Redis with constant SCANs and ensures that you don't have a continual cycle of process death due to poison pill jobs.

In summary, super_fetch might recover jobs in 5 minutes or 3 hours, there's no guarantee. Restarting a process is the best way to signal Sidekiq Pro to look for orphans.

Fetch algorithms

super_fetch supports the same two queue prioritization mechanisms as Sidekiq's basic fetch: strict priority and weighted random.

Strict ordering

sidekiq -e production -q critical -q default -q bulk

Beware that strict ordering can lead to starvation: bulk jobs will only be processed once the critical and default queues are empty. You can switch ordering for different processes to ensure everyone gets processed:

sidekiq -e production -q critical -q default -q bulk
sidekiq -e production -q bulk -q default -q critical

Weighted random

sidekiq -e production -q critical,3 -q default,2 -q bulk,1

When using weighted ordering, sidekiq will randomly choose a queue to check, without blocking, using weighted random choice. For example, in the command given above, sidekiq will sample from the array ["critical", "critical", "critical", "default", "default", "bulk"] so critical will be checked first 50% of the time.

Scheduler

Sidekiq's default scheduler is not atomic, it pops jobs off the scheduled queue and enqueues them with two network round trips. Sidekiq Pro offers a reliable scheduler which uses Lua to perform the same task atomically:

Sidekiq.configure_server do |config|
  config.reliable_scheduler!
end

This feature is optional but highly recommended to enable. It does have the drawback that client-side middleware is not invoked when enqueuing the scheduled jobs, since the entire operation takes place within Redis. It is not safe to enable if you are running Redis Cluster. More detail

Notes

  • super_fetch is more sensitive to Redis network latency than Sidekiq's default basic_fetch, especially if you have lots of queues and high concurrency. This can result in idle processor threads, starved for jobs. Check out Using Redis for tips on measuring Redis latency.
  • Older versions of Sidekiq Pro offered reliable_fetch and timed_fetch. These algorithms are now deprecated and no longer documented.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.