Redesign Sidekiq internals #2593

Merged
merged 58 commits into from Oct 28, 2015

Projects

None yet
@mperham
Owner
mperham commented Oct 7, 2015

My recent experiment to remove Celluloid has reaped huge rewards. This branch does several things:

  1. Rewrite Sidekiq actors to use raw Threads and data structures.
  2. Make Processor#stats asynchronous, which removes two Redis round trips per job.
  3. Redesign job fetch so each Processor fetches its own job in parallel, this (along with (2)) makes the system much more resilient to higher Redis latency

Processing 100,000 no-op jobs with one process and 25 threads on MRI 2.2:

  • Master: 125 sec
  • Without Celluloid: 57 sec
  • Without Celluloid and Async stats: 20 sec
  • With 1ms of latency, sequential fetch: 7 min
  • With 1ms of latency, parallel fetch: 20 sec

This work would be released as Sidekiq 4.0, with an ETA of Thanksgiving.

@jonhyman
Collaborator
jonhyman commented Oct 7, 2015

😮

@ryansch ryansch commented on the diff Oct 7, 2015
lib/sidekiq/processor.rb
- def initialize(boss)
- @boss = boss
+ def kill(wait=false)
+ # unlike the other actors, terminate does not wait
+ # for the thread to finish because we don't know how
+ # long the job will take to finish. Instead we
+ # provide a `kill` method to call after the shutdown
+ # timeout passes.
+ @thread.raise ::Sidekiq::Shutdown
@ryansch
ryansch Oct 7, 2015 Collaborator

I've read in a number of places that Thread#raise and Thread#kill are unsafe. Has that changed? If so, what would the minimum ruby version be?

@mperham
mperham Oct 7, 2015 Owner

It's not safe but it's as safe as I can make it. Processor#kill is only called when the hard shutdown timeout has passed and the process must exit.

@ryansch
ryansch Oct 7, 2015 Collaborator

Gotcha. The whole building is coming down anyway.

added some commits Oct 7, 2015
@mperham Fix tests
89a1914
@mperham Move fetching into the processor
This removes thread context switching and network delay.
56ea001
@mperham fix merge bug 561d9d4
@mperham Add 1ms of latency to benchmarks 3d80580
@mperham merge load changes
f8e6275
@mperham load changes 862d44b
@mperham Bump to 4.0
f9653e8
@uhoh-itsmaciek

Looks great! Did you by chance measure memory usage as well?

@mperham
Owner
mperham commented Oct 8, 2015

8x less garbage generated!

On Oct 7, 2015, at 22:15, Maciek Sakrejda notifications@github.com wrote:

Looks great. Did you by chance measure memory usage as well?


Reply to this email directly or view it on GitHub.

added some commits Oct 8, 2015
@mperham Cleanup, tests passing
da02fcb
@mperham Fix quick shutdown, more cleanup 4a43865
@mperham Merge branch 'master' into internal_rewrite
f4c29b1
@mperham Pick up concurrent-ruby fix
5e88414
@mperham Fix tests
20fecf5
@tenderlove tenderlove commented on the diff Oct 9, 2015
lib/sidekiq/scheduled.rb
- logger.error ex.message
- logger.error ex.backtrace.first
+ # Shut down this Fetcher instance, will pause until
+ # the thread is dead.
+ def terminate
+ @done = true
+ if @thread
+ t = @thread
+ @thread = nil
+ @queue << 0
+ t.value
+ end
+ end
+
+ def start
+ @thread ||= safe_thread("scheduler") do
@tenderlove
tenderlove Oct 9, 2015

Is start meant to be thread safe? There's a read-check-write race here along with a race in terminate. Also since @thread isn't initialized, it will cause warnings.

@tenderlove
tenderlove Oct 9, 2015

(probably should also raise an exception if @done is true when this method gets called)

@mperham
mperham Oct 9, 2015 Owner

Poller#start should only ever be called once by the main thread when calling Launcher#start, so thread safety is not necessary here. There's definitely room for more defensive design.

@tenderlove
tenderlove Oct 9, 2015

Presumably the allocated thread will just sleep while the queue is empty, so it might be possible to just make start private and call it from initialize. Then you can eliminate the conditional here and in terminate. < /armchairengineering >

@mperham
mperham Oct 9, 2015 Owner

I like to be able to test the object's APIs without a background thread spinning up in the test suite.

@tarcieri
Collaborator
tarcieri commented Oct 9, 2015

Have you considered using a ThreadPoolExecutor? This will give you a very fast and robust implementation on JRuby (now):

https://ruby-concurrency.github.io/concurrent-ruby/Concurrent/ThreadPoolExecutor.html

@mperham
Owner
mperham commented Oct 9, 2015

@tarcieri I don't understand how a TPE would help vs me spinning up threads.

@tarcieri
Collaborator
tarcieri commented Oct 9, 2015

@mperham on JRuby at least, it's backed by a java.util.concurrent.ThreadPoolExecutor and should have both lower thread coordination and memory overhead than anything that can be done in pure Ruby

added some commits Oct 9, 2015
@mperham merge master
1e9e209
@mperham code shuffling and cleanup
8bc677f
@mperham Promote Shutdown exception into main module
3e20710
@mperham Move proctitles into CLI, so they can be accessed in startup events
9fc8267
@mhenrixon
Contributor

Solid! Looks like you were able to delete more code than you added 👍 Anything I need to worry about with the middleware? Couldn't see anything while going over the code.

@mperham
Owner
mperham commented Oct 12, 2015

Most public APIs should be unchanged, including middleware. As of today. :-)

On Oct 12, 2015, at 10:16, Mikael Henriksson notifications@github.com wrote:

Solid! Looks like you were able to delete more code than you added Anything I need to worry about with the middleware? Couldn't see anything while going over the code.


Reply to this email directly or view it on GitHub.

@halorgium
Collaborator

@mperham very great work here. 💃

@mperham Free lifecycle procs so they GC
4fd83bf
@drewblas

@mperham What happens if the status updator/heartbeat thread dies? Will it have status info about jobs that finished but it never sends that info back to Redis? It seems like the buffering of stats creates a hole for data to get lost in a crash, resulting in even more double-run or stuck jobs.

@nviennot

@mperham Glad to see that you are dropping Celluloid :) -- good move!

@mperham Make global stats processing gracefully handle errors
99a3194
@mperham
Owner
mperham commented Oct 14, 2015

@drewblas I'm not sure what you are referring to. If you can be more explicit about a particular failure case, I can explain what should happen.

added some commits Oct 14, 2015
@mperham Remove old poll_interval attribute 9a062a4
@mperham initial 4.0 upgrade notes
789643c
@mperham Formatting
9a88a1f
@Magicdream

@mperham

8x less garbage generated!

Is it because of removed Celluloid?

added some commits Oct 20, 2015
@mperham touch the pool before startup event
6f1e0b0
@mperham Verify Redis 2.8+ on startup d285f19
@mperham Verify server connection pool is large enough for new fetch design 907828c
@mperham Rework notes
879041b
@mperham For safety, flushdb Redis before each test 9736321
@mperham Found and killed the test suite instability.
Manager#processor_died test was spinning up a Processor in the background which ate jobs created in other tests!
58b2696
@mperham
Owner
mperham commented Oct 22, 2015

I did a quick Resque 1.25.2 benchmark for comparison. Time to process 10,000 noop jobs:

1 process, 96 sec
25 processes, 31 sec

Extrapolating to 100,000 jobs would be ~300 seconds. Sidekiq does the same in 20 seconds so Sidekiq's job overhead appears to be 15x less.

@mperham merge master
87df35a
@JustinLove JustinLove referenced this pull request in JustinLove/autoscaler Oct 23, 2015
Merged

Sidekiq is moving to threads #55

added some commits Oct 23, 2015
@mperham Merge branch 'master' into internal_rewrite
061bcde
@mperham Remove deprecated method 2c81c48
@mperham cleanup 18a513c
@mperham Change old "message" terminology to "job" 93dddd7
@mperham Prerelease version
a04e9db
@mperham Sidekiq Pro 3.0 release notes
d239027
@mperham Merge branch 'master' into internal_rewrite
eedd34b
added some commits Oct 28, 2015
@mperham ent changes
a1270a2
@mperham Merge branch 'master' into internal_rewrite
7149c9b
@mperham merge master
3318b89
@mperham 4.0 changes
6ad6a3a
@mperham mperham merged commit 6ad6a3a into master Oct 28, 2015

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
@halorgium
Collaborator

👍

@narapon
narapon commented Oct 28, 2015

Excited to try this, good work!
On 29 Oct 2015 03:05, "Tim Carey-Smith" notifications@github.com wrote:

[image: 👍]


Reply to this email directly or view it on GitHub
#2593 (comment).

@hecbuma
hecbuma commented Oct 29, 2015

pretty nice 👍

@esbanarango

@mperham Can we start using v4.0 now?

Thank you for all this work!! 👍

@mperham mperham deleted the internal_rewrite branch Oct 29, 2015
@take-five

@mperham I guess Processor:: namespace missed here and below

@fedenusy fedenusy referenced this pull request in alloy/lowdown May 12, 2016
Closed

Battle Testing #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment