Ent Rate Limiting

Chien-Wei Chu edited this page Aug 30, 2016 · 16 revisions

Often 3rd party APIs will enforce a rate limit, meaning you cannot call them faster than your SLA allows. Sidekiq Enterprise contains a rate limiting API with three styles of rate limiting: concurrent, window and bucket. This feature requires Redis 2.8+.

Note: limiters are somewhat heavyweight to create, requiring a round-trip to Redis. If possible, create limiters once during startup and reuse them (as with ERP_THROTTLE below). They are thread-safe and designed to be shared.

Concurrent

The concurrent style means that only N concurrent operations can happen at any moment in time. For instance, I've used an ERP SaaS which limited each customer to 50 concurrent operations. Use a concurrent rate limiter to ensure your processes stay within that rate limit:

ERP_THROTTLE = Sidekiq::Limiter.concurrent('erp', 50, wait_timeout: 5, lock_timeout: 30)

def perform(...)
  ERP_THROTTLE.within_limit do
    # call ERP
  end
end

Since concurrent access has to hold a lock, the lock_timeout option ensures a crashed Ruby process does not hold a lock forever. You must ensure that your operations take less than this number of seconds. After lock_timeout seconds, the lock can be reclaimed by another thread wanting to perform an operation.

You can use a concurrent limiter of size 1 to make a distributed mutex, ensuring that only one process can execute a block at a time.

Concurrent limiters will pause up to wait_timeout seconds for a lock to become available. This API is blocking and as efficient as possible: it does not poll unlike most other locking or mutex libraries for Redis. Blocking ensures the lock will be made available to a waiter within milliseconds of it being released.

Concurrent Metrics

The concurrent rate limiter tracks the following metrics:

  • Held - the number of times this limiter held a lock, meaning the block was executed.
  • Held Time - total time a lock was held, in seconds.
  • Immediate - the number of times a lock was available immediately, without waiting
  • Waited - the number of times a worker had to wait for a lock to become available
  • Wait Time - total time workers waited for a lock
  • Overages - number of times the block took longer than lock_timeout to execute, this is bad
  • Reclaimed - number of times another worker reclaimed a lock that was over timeout, this is very bad and can lead to rate limit violations

Bucket

Bucket means that each interval is a bucket: you can perform 5 operations at 12:42:51.999 and then another 5 operations at 12:42:52.000 because they are tracked in a different bucket.

Here's an example using a bucket limiter of 30 per second (notice how the name includes the user's ID, making it a user-specific limiter). Let's say we want to call Stripe on behalf of a user:

def perform(user_id)
  user_throttle = Sidekiq::Limiter.bucket("stripe-#{user_id}", 30, :second, wait_timeout: 5)
  user_throttle.within_limit do
    # call stripe with user's account creds
  end
end

The limiter will try to perform the operation once per second until wait_timeout is passed or the rate limit is satisfied. It calls sleep to achieve this so the worker thread is paused during that sleep time.

You can also use :minute, :hour or :day buckets but they will not sleep until the next interval and retry the operation. They immediately raise Sidekiq::Limiter::OverLimit.

You can see recent usage history for bucket limiters in the Web UI.

Window

Window means that each interval is a sliding window: you can perform N operations at 12:42:51.999 but can't perform another N operations until 12:42:52.999.

Here's an example using a window limiter of 5 per second (notice how the name includes the user's ID, making it a user-specific limiter). Let's say we want to call Stripe on behalf of a user:

def perform(user_id)
  user_throttle = Sidekiq::Limiter.window("stripe-#{user_id}", 5, :second, wait_timeout: 5)
  user_throttle.within_limit do
    # call stripe with user's account creds
  end
end

A :second limiter will try to perform the operation every half second until wait_timeout is passed or the rate limit is satisfied. It calls sleep to achieve this so the worker thread is paused during that sleep time.

You can also use :minute, :hour or :day buckets but they will not sleep until the next interval and retry the operation. They immediately raise Sidekiq::Limiter::OverLimit.

In addition to the :second, :minute, :hour and :day symbols, window limiters can accept an arbitrary amount of seconds for the window:

# allow 5 operations within a 30 second window
Sidekiq::Limiter.window("stripe-#{user_id}", 5, 30)

Limiting is not Throttling

Rate limiters do not slow down Sidekiq's job processing. If you push 1000 jobs to Redis, Sidekiq will run those jobs as fast as possible which may cause many of those jobs to fail with an OverLimit error. If you want to trickle jobs into Sidekiq slowly, the only way to do that is with manual scheduling. Here's how you can schedule 1 job per second to ensure that Sidekiq doesn't run all jobs immediately:

1000.times do |index|
  SomeWorker.perform_in(index, some_args)
end

Remember that Sidekiq's scheduler checks every 15 seconds on average so you can get a small clump of jobs running concurrently.

Take it to the Limit

If the rate limit is breached and cannot be satisfied within wait_timeout, the Limiter will raise Sidekiq::Limiter::OverLimit.

If you violate a rate limit within a Sidekiq job, Sidekiq will reschedule the job to run again soon using a linear backoff policy. After 20 rate limit failures (approx one day), the middleware will treat the failing job as a retry.

2015-05-28T23:25:23.159Z 73456 TID-oxf94yioo LimitedWorker JID-41c51a2123eef30dbad4544a INFO: erp over rate limit, rescheduling for later

Advanced Options

Place the Sidekiq::Limiter.configure block in your initializer to configure these options.

Back off

You can configure how the limiter subsystem backs off by providing your own custom proc:

Sidekiq::Limiter.configure do |config|
  # job is the job hash, 'overrated' is the number of times we've failed due to rate limiting
  # limiter is the associated limiter that raised the OverLimit error
  # By default, back off 5 minutes for each rate limit failure
  config.backoff = ->(limiter, job) do
    (300 * job['overrated']) + rand(300) + 1
  end
end

Redis

Rate limiting is unusually hard on Redis for a Sidekiq feature. For this reason, you might want to use a different Redis instance for the rate limiting subsystem as you scale up.

Rate limiting is shared by ALL processes using the same Redis configuration. If you have 50 Ruby processes connected to the same Redis instance, they will all use the same rate limits. You can configure the Redis instance used by rate limiting:

Sidekiq::Limiter.configure do |config|
  config.redis = { size: 10, url: 'redis://localhost/15' }
end

By default, the Sidekiq::Limiter API uses Sidekiq's default Redis pool so you don't need to configure anything.

Custom Errors

If you have a library which raises a custom exception to signify a rate limit failure, you can add it to the list of errors which trigger backoff:

Sidekiq::Limiter.configure do |config|
  config.errors << SomeLib::SlowDownPlease
end

TTL

By default, Limiter metadata expires after 90 days. If you are creating lots of dynamic limiters and want to minimize the memory overhead of having millions of unused limiters, you can pass in a ttl option with the number of seconds to live. I don't recommend a value lower than 24 hours.

Sidekiq::Limiter.window("stripe-#{user_id}", 5, 30, ttl: 2.weeks)

Web UI

The Web UI contains a "Limits" tab which lists all limits configured in the system. Enable the tab by requiring the Enterprise web extensions:

require 'sidekiq/web'
require 'sidekiq-ent/web'

Concurrent limiters track a number of metrics and expose those metrics in the UI.

screenshot

Bucket limiters track recent history so you can see a graph of recent usage.

screenshot