Ent Rate Limiting

Mike Perham edited this page Jun 21, 2018 · 29 revisions

Often 3rd party APIs will enforce a rate limit, meaning you cannot call them faster than your SLA allows. Sidekiq Enterprise contains a rate limiting API with three styles of rate limiting: concurrent, window and bucket. This feature requires Redis 2.8+.

The rate limiting API works in any Ruby process. It's not specific to Sidekiq jobs or limited to use within perform. For example, you can use this API to rate limit requests within Puma.

Note: create limiters once during startup and reuse them (as with ERP_LIMIT below). They are thread-safe and designed to be shared.

Concurrent

The concurrent style means that only N concurrent operations can happen at any moment in time. For instance, I've used an ERP SaaS which limited each customer to 50 concurrent operations. Use a concurrent rate limiter to ensure your processes stay within that rate limit:

ERP_LIMIT = Sidekiq::Limiter.concurrent('erp', 50, wait_timeout: 5, lock_timeout: 30)

def perform(...)
  ERP_LIMIT.within_limit do
    # call ERP
  end
end

Since concurrent access has to hold a lock, the lock_timeout option ensures a crashed Ruby process does not hold a lock forever. You must ensure that your operations take less than this number of seconds. After lock_timeout seconds, the lock can be reclaimed by another thread wanting to perform an operation.

You can use a concurrent limiter of size 1 to make a distributed mutex, ensuring that only one process can execute a block at a time.

Concurrent limiters will pause up to wait_timeout seconds for a lock to become available. This API is blocking and as efficient as possible: it does not poll unlike most other locking or mutex libraries for Redis. Blocking ensures the lock will be made available to a waiter within milliseconds of it being released.

Concurrent Metrics

The concurrent rate limiter tracks the following metrics:

  • Held - the number of times this limiter held a lock, meaning the block was executed.
  • Held Time - total time a lock was held, in seconds.
  • Immediate - the number of times a lock was available immediately, without waiting
  • Waited - the number of times a worker had to wait for a lock to become available
  • Wait Time - total time workers waited for a lock
  • Overages - number of times the block took longer than lock_timeout to execute, this is bad
  • Reclaimed - number of times another worker reclaimed a lock that was over timeout, this is very bad and can lead to rate limit violations

Bucket

Bucket means that each interval is a bucket: you can perform 5 operations at 12:42:51.999 and then another 5 operations at 12:42:52.000 because they are tracked in a different bucket.

Here's an example using a bucket limiter of 30 per second (notice how the name includes the user's ID, making it a user-specific limiter). Let's say we want to call Stripe on behalf of a user:

def perform(user_id)
  user_throttle = Sidekiq::Limiter.bucket("stripe-#{user_id}", 30, :second, wait_timeout: 5)
  user_throttle.within_limit do
    # call stripe with user's account creds
  end
end

The limiter will try to perform the operation once per second until wait_timeout is passed or the rate limit is satisfied. It calls sleep to achieve this so the worker thread is paused during that sleep time. If the wait_timeout duration is passed, the limiter will raise Sidekiq::Limiter::OverLimit — that exception is caught in middleware and automatically reschedules the job in the future based on the limiter's config.backoff result. If an individual job is scheduled by the limiter more than 20 times, the OverLimit will be re-raised as if it were a job failure, then the job will be retried as usual.

You can also use :minute, :hour or :day buckets but they will not sleep until the next interval and retry the operation. They immediately raise Sidekiq::Limiter::OverLimit and the job will be rescheduled as above, subject to the limit of 20 reschedules.

You can see recent usage history for bucket limiters in the Web UI.

Window

Window means that each interval is a sliding window: you can perform N operations at 12:42:51.999 but can't perform another N operations until 12:42:52.999.

Here's an example using a window limiter of 5 per second (notice how the name includes the user's ID, making it a user-specific limiter). Let's say we want to call Stripe on behalf of a user:

def perform(user_id)
  user_throttle = Sidekiq::Limiter.window("stripe-#{user_id}", 5, :second, wait_timeout: 5)
  user_throttle.within_limit do
    # call stripe with user's account creds
  end
end

In addition to :second, you can also use :minute, :hour, or :day intervals. No matter which interval is used, sleep(0.5) will be called until wait_timeout is passed or the rate limiter is satisfied. Because it calls sleep to achieve this, the worker thread is paused during that sleep time. If the wait_timeout duration is passed, the limiter will raise Sidekiq::Limiter::OverLimit — that exception is caught in middleware and automatically reschedules the job in the future based on the limiter's config.backoff result. If an individual job is scheduled by the limiter more than 20 times, the OverLimit will be re-raised as if it were a job failure, then the job will be retried as usual.

Note that if the wait_timeout value is shorter than the interval in seconds, the limiter will immediately raise Sidekiq::Limiter::OverLimit and the job will be rescheduled as above, subject to the limit of 20 reschedules. For example, with an interval of :minute, any wait_timeout value below 60 will cause an immediate OverLimit.

In addition to the :second, :minute, :hour and :day symbols, window limiters can accept an arbitrary amount of seconds for the window:

# allow 5 operations within a 30 second window
Sidekiq::Limiter.window("stripe-#{user_id}", 5, 30)

Unlimited

The unlimited limiter is a rate limiter which always executes its block. This useful for conditional rate limiting -- for example, admin users or customers at a certain tier of service don't have a rate limit.

ERP = Sidekiq::Limiter.concurrent("erp", 10)

def perform(...)
  lmtr = current_user.admin? ? Sidekiq::Limiter.unlimited : ERP
  lmtr.within_limit do
    # always executes for admins
  end
end

Limiting is not Throttling

Rate limiters do not slow down Sidekiq's job processing. If you push 1000 jobs to Redis, Sidekiq will run those jobs as fast as possible which may cause many of those jobs to fail with an OverLimit error. If you want to trickle jobs into Sidekiq slowly, the only way to do that is with manual scheduling. Here's how you can schedule 1 job per second to ensure that Sidekiq doesn't run all jobs immediately:

1000.times do |index|
  SomeWorker.perform_in(index, some_args)
end

Remember that Sidekiq's scheduler checks every 5 seconds on average so you can get a small clump of jobs running concurrently.

Take it to the Limit

If the rate limit is breached and cannot be satisfied within wait_timeout, the Limiter will raise Sidekiq::Limiter::OverLimit.

If you violate a rate limit within a Sidekiq job, Sidekiq will reschedule the job to run again soon using a linear backoff policy. After 20 rate limit failures (approx one day), the middleware will treat the failing job as a retry.

2015-05-28T23:25:23.159Z 73456 TID-oxf94yioo LimitedWorker JID-41c51a2123eef30dbad4544a INFO: erp over rate limit, rescheduling for later

Advanced Options

Place the Sidekiq::Limiter.configure block in your initializer to configure these options.

Back off

You can configure how the limiter subsystem backs off by providing your own custom proc:

Sidekiq::Limiter.configure do |config|
  # job is the job hash, 'overrated' is the number of times we've failed due to rate limiting
  # limiter is the associated limiter that raised the OverLimit error
  # By default, back off 5 minutes for each rate limit failure
  config.backoff = ->(limiter, job) do
    (300 * job['overrated']) + rand(300) + 1
  end
end

Redis

Rate limiting is unusually hard on Redis for a Sidekiq feature. For this reason, you might want to use a different Redis instance for the rate limiting subsystem as you scale up.

Rate limiting is shared by ALL processes using the same Redis configuration. If you have 50 Ruby processes connected to the same Redis instance, they will all use the same rate limits. You can configure the Redis instance used by rate limiting:

Sidekiq::Limiter.configure do |config|
  config.redis = { size: 10, url: 'redis://localhost/15' }
end

By default, the Sidekiq::Limiter API uses Sidekiq's default Redis pool so you don't need to configure anything.

Testing

The unlimited limiter does not use Redis so you can conditionally use it anywhere (like a test suite) where you don't want to require Redis or accidentally trip rate limits.

def test_myworker
  my = MyWorker.new
  my.limiter = Sidekiq::Limiter.unlimited
  my.perform(...)
end

Custom Errors

If you have a library which raises a custom exception to signify a rate limit failure, you can add it to the list of errors which trigger backoff:

Sidekiq::Limiter.configure do |config|
  config.errors << SomeLib::SlowDownPlease
end

TTL

By default, Limiter metadata expires after 90 days. If you are creating lots of dynamic limiters and want to minimize the memory overhead of having millions of unused limiters, you can pass in a ttl option with the number of seconds to live. I don't recommend a value lower than 24 hours.

Sidekiq::Limiter.window("stripe-#{user_id}", 5, 30, ttl: 2.weeks)

Web UI

The Web UI contains a "Limits" tab which lists all limits configured in the system. Enable the tab by requiring the Enterprise web extensions:

require 'sidekiq/web'
require 'sidekiq-ent/web'

Concurrent limiters track a number of metrics and expose those metrics in the UI.

screenshot

Bucket limiters track recent history so you can see a graph of recent usage.

screenshot

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.