Redesign Sidekiq internals #2593

merged 58 commits into from Oct 28, 2015


None yet
mperham commented Oct 7, 2015

My recent experiment to remove Celluloid has reaped huge rewards. This branch does several things:

  1. Rewrite Sidekiq actors to use raw Threads and data structures.
  2. Make Processor#stats asynchronous, which removes two Redis round trips per job.
  3. Redesign job fetch so each Processor fetches its own job in parallel, this (along with (2)) makes the system much more resilient to higher Redis latency

Processing 100,000 no-op jobs with one process and 25 threads on MRI 2.2:

  • Master: 125 sec
  • Without Celluloid: 57 sec
  • Without Celluloid and Async stats: 20 sec
  • With 1ms of latency, sequential fetch: 7 min
  • With 1ms of latency, parallel fetch: 20 sec

This work would be released as Sidekiq 4.0, with an ETA of Thanksgiving.

jonhyman commented Oct 7, 2015


@ryansch ryansch commented on the diff Oct 7, 2015
- def initialize(boss)
- @boss = boss
+ def kill(wait=false)
+ # unlike the other actors, terminate does not wait
+ # for the thread to finish because we don't know how
+ # long the job will take to finish. Instead we
+ # provide a `kill` method to call after the shutdown
+ # timeout passes.
+ @thread.raise ::Sidekiq::Shutdown
ryansch Oct 7, 2015 Collaborator

I've read in a number of places that Thread#raise and Thread#kill are unsafe. Has that changed? If so, what would the minimum ruby version be?

mperham Oct 7, 2015 Owner

It's not safe but it's as safe as I can make it. Processor#kill is only called when the hard shutdown timeout has passed and the process must exit.

ryansch Oct 7, 2015 Collaborator

Gotcha. The whole building is coming down anyway.

added some commits Oct 7, 2015
@mperham Fix tests
@mperham Move fetching into the processor
This removes thread context switching and network delay.
@mperham fix merge bug 561d9d4
@mperham Add 1ms of latency to benchmarks 3d80580
@mperham merge load changes
@mperham load changes 862d44b
@mperham Bump to 4.0

Looks great! Did you by chance measure memory usage as well?

mperham commented Oct 8, 2015

8x less garbage generated!

On Oct 7, 2015, at 22:15, Maciek Sakrejda wrote:

Looks great. Did you by chance measure memory usage as well?

Reply to this email directly or view it on GitHub.

added some commits Oct 8, 2015
@mperham Cleanup, tests passing
@mperham Fix quick shutdown, more cleanup 4a43865
@mperham Merge branch 'master' into internal_rewrite
@mperham Pick up concurrent-ruby fix
@mperham Fix tests
@tenderlove tenderlove commented on the diff Oct 9, 2015
- logger.error ex.message
- logger.error ex.backtrace.first
+ # Shut down this Fetcher instance, will pause until
+ # the thread is dead.
+ def terminate
+ @done = true
+ if @thread
+ t = @thread
+ @thread = nil
+ @queue << 0
+ t.value
+ end
+ end
+ def start
+ @thread ||= safe_thread("scheduler") do
tenderlove Oct 9, 2015

Is start meant to be thread safe? There's a read-check-write race here along with a race in terminate. Also since @thread isn't initialized, it will cause warnings.

tenderlove Oct 9, 2015

(probably should also raise an exception if @done is true when this method gets called)

mperham Oct 9, 2015 Owner

Poller#start should only ever be called once by the main thread when calling Launcher#start, so thread safety is not necessary here. There's definitely room for more defensive design.

tenderlove Oct 9, 2015

Presumably the allocated thread will just sleep while the queue is empty, so it might be possible to just make start private and call it from initialize. Then you can eliminate the conditional here and in terminate. < /armchairengineering >

mperham Oct 9, 2015 Owner

I like to be able to test the object's APIs without a background thread spinning up in the test suite.

tarcieri commented Oct 9, 2015

Have you considered using a ThreadPoolExecutor? This will give you a very fast and robust implementation on JRuby (now):

mperham commented Oct 9, 2015

@tarcieri I don't understand how a TPE would help vs me spinning up threads.

tarcieri commented Oct 9, 2015

@mperham on JRuby at least, it's backed by a java.util.concurrent.ThreadPoolExecutor and should have both lower thread coordination and memory overhead than anything that can be done in pure Ruby

added some commits Oct 9, 2015
@mperham merge master
@mperham code shuffling and cleanup
@mperham Promote Shutdown exception into main module
@mperham Move proctitles into CLI, so they can be accessed in startup events

Solid! Looks like you were able to delete more code than you added 👍 Anything I need to worry about with the middleware? Couldn't see anything while going over the code.

mperham commented Oct 12, 2015

Most public APIs should be unchanged, including middleware. As of today. :-)

On Oct 12, 2015, at 10:16, Mikael Henriksson wrote:

Solid! Looks like you were able to delete more code than you added Anything I need to worry about with the middleware? Couldn't see anything while going over the code.

Reply to this email directly or view it on GitHub.


@mperham very great work here. 💃

@mperham Free lifecycle procs so they GC

@mperham What happens if the status updator/heartbeat thread dies? Will it have status info about jobs that finished but it never sends that info back to Redis? It seems like the buffering of stats creates a hole for data to get lost in a crash, resulting in even more double-run or stuck jobs.


@mperham Glad to see that you are dropping Celluloid :) -- good move!

@mperham Make global stats processing gracefully handle errors
mperham commented Oct 14, 2015

@drewblas I'm not sure what you are referring to. If you can be more explicit about a particular failure case, I can explain what should happen.

added some commits Oct 14, 2015
@mperham Remove old poll_interval attribute 9a062a4
@mperham initial 4.0 upgrade notes
@mperham Formatting


8x less garbage generated!

Is it because of removed Celluloid?

added some commits Oct 20, 2015
@mperham touch the pool before startup event
@mperham Verify Redis 2.8+ on startup d285f19
@mperham Verify server connection pool is large enough for new fetch design 907828c
@mperham Rework notes
@mperham For safety, flushdb Redis before each test 9736321
@mperham Found and killed the test suite instability.
Manager#processor_died test was spinning up a Processor in the background which ate jobs created in other tests!
mperham commented Oct 22, 2015

I did a quick Resque 1.25.2 benchmark for comparison. Time to process 10,000 noop jobs:

1 process, 96 sec
25 processes, 31 sec

Extrapolating to 100,000 jobs would be ~300 seconds. Sidekiq does the same in 20 seconds so Sidekiq's job overhead appears to be 15x less.

@mperham merge master
@JustinLove JustinLove referenced this pull request in JustinLove/autoscaler Oct 23, 2015

Sidekiq is moving to threads #55

added some commits Oct 23, 2015
@mperham Merge branch 'master' into internal_rewrite
@mperham Remove deprecated method 2c81c48
@mperham cleanup 18a513c
@mperham Change old "message" terminology to "job" 93dddd7
@mperham Prerelease version
@mperham Sidekiq Pro 3.0 release notes
@mperham Merge branch 'master' into internal_rewrite
added some commits Oct 28, 2015
@mperham ent changes
@mperham Merge branch 'master' into internal_rewrite
@mperham merge master
@mperham 4.0 changes
@mperham mperham merged commit 6ad6a3a into master Oct 28, 2015

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
continuous-integration/travis-ci/push The Travis CI build is in progress


narapon commented Oct 28, 2015

Excited to try this, good work!
On 29 Oct 2015 03:05, "Tim Carey-Smith" wrote:

[image: 👍]

Reply to this email directly or view it on GitHub
#2593 (comment).

hecbuma commented Oct 29, 2015

pretty nice 👍


@mperham Can we start using v4.0 now?

Thank you for all this work!! 👍

@mperham mperham deleted the internal_rewrite branch Oct 29, 2015

@mperham I guess Processor:: namespace missed here and below

@fedenusy fedenusy referenced this pull request in alloy/lowdown May 12, 2016

Battle Testing #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment