Fibers don't seem to reuse threads #1443

Closed
cheald opened this Issue Jan 26, 2014 · 7 comments

Projects

None yet

3 participants

@cheald
Contributor
cheald commented Jan 26, 2014

I'm working on speeding up a Celluloid app, and I've noticed that jruby spends an awful lot of time in ThreadFiber.createThread(). JRuby is supposed to use a thread pool for fibers, but despite the live thread count being mostly consistent, the total created thread count balloons rapidly, and exhibits incrementing thread IDs.

This is jruby-1.7.9 and the behavior happens on both openjdk-1.6.0 and openjdk-1.7.0.

Minimum reproduction case: https://gist.github.com/cheald/8626163

Output is something like:

2360 <Thread 1978280> - 99.1/s
2388 <Thread 1978286> - 99.1/s
2380 <Thread 1978294> - 99.1/s
2370 <Thread 1978298> - 99.1/s
2400 <Thread 1978304> - 99.1/s
2360 <Thread 1978312> - 99.1/s
2388 <Thread 1978316> - 99.1/s
2380 <Thread 1978324> - 99.1/s
2370 <Thread 1978330> - 99.1/s
2400 <Thread 1978334> - 99.1/s
2360 <Thread 1978342> - 99.1/s
2388 <Thread 1978346> - 99.1/s
2380 <Thread 1978354> - 99.1/s
2370 <Thread 1978360> - 99.1/s
2400 <Thread 1978366> - 99.1/s
2360 <Thread 1978372> - 99.1/s
2388 <Thread 1978376> - 99.1/s
2380 <Thread 1978384> - 99.1/s
2370 <Thread 1978390> - 99.1/s
2400 <Thread 1978396> - 99.1/s
2360 <Thread 1978400> - 99.1/s
2388 <Thread 1978406> - 99.1/s
2380 <Thread 1978414> - 99.1/s
2370 <Thread 1978420> - 99.1/s
2400 <Thread 1978426> - 99.1/s
2360 <Thread 1978430> - 99.1/s
2388 <Thread 1978436> - 99.1/s
2380 <Thread 1978444> - 99.1/s
2370 <Thread 1978450> - 99.1/s

You'll notice that the thread IDs are monotonically increasing. This is consistent.

VisuamVM CPU sampling shows that a tremendous amount of time is being spent in createThread():

ja8jngk

VM stats show a ton of threads created, even though the live threads are stable:

2014-01-25 19_39_26-java monitoring management console

This feels an awful lot like Fibers are not correctly using a thread pool, and are instead creating new short-lived threads for every fiber, which is resulting in inordinate amounts of overhead.

@cheald
Contributor
cheald commented Jan 26, 2014

The test case provided here doesn't cause the monotonically increasing thread issue under jruby-1.7.4, but it does happen under 1.7.5. Hopefully that helps narrow the search!

@cheald
Contributor
cheald commented Jan 28, 2014

Per the referenced Celluloid issue, I've implemented a Fiber Pool in Celluloid that improved the performance of my app (in terms of jobs/sec) by ~35x; it's completely eliminated the thread creation thrashing. This seems pretty conclusive to me that there's an issue with JRuby's fiber implementation.

One issue I've run into in the pooled implementation is that if you end up with a suspended fiber that belongs to a dead thread, there's no way to resume the fiber and let it finish executing. I think that the fix here is probably for JRuby to go ahead and terminate and fibers left outstanding when an owning thread terminates. Otherwise, you end up with Java threads that seem to never go away.

@headius
Member
headius commented Jan 28, 2014

I have a patch to again use a thread pool built into JRuby. I'm not sure if it should improve perf more than this, but here's my numbers after about a minute running your script:

2036 <Thread 20512> - 92.5/s
2032 <Thread 20514> - 92.6/s
2048 <Thread 20516> - 92.6/s
2044 <Thread 20518> - 92.6/s
2040 <Thread 20520> - 92.5/s
2036 <Thread 20522> - 92.5/s
2032 <Thread 20524> - 92.6/s
2048 <Thread 20526> - 92.6/s
2044 <Thread 20528> - 92.6/s
2040 <Thread 20530> - 92.5/s
2036 <Thread 20532> - 92.5/s
2032 <Thread 20534> - 92.6/s
2048 <Thread 20536> - 92.6/s
2044 <Thread 20538> - 92.6/s
2040 <Thread 20540> - 92.5/s
2036 <Thread 20542> - 92.5/s
2032 <Thread 20544> - 92.6/s

Patch is here: https://gist.github.com/headius/40ed4a4435e0a4e0ba44

Note that there's several levels of re-use possible for our native-threaded fibers. This patch implements only the first level, reusing native threads.

  1. Reuse native threads. Avoids overhead of spinning up native threads and load on the system and JVM if they start to pile up.
  2. Reuse ThreadContext instances. ThreadContext maintains per-thread runtime state like frames, heap-scoped variables, etc. Reusing would require doing some scrubbing.
  3. Reuse RubyThread instances. This is what you're seeing increase above. If we do (1), there's not a 1:1 correlation between RubyThread and native thread. Some scrubbing is necessary here as well, e.g. thread-local variables.
  4. Reuse Fiber instances. Unsure how much this would gain us, but it would avoid recreating the fiber object and the control structures that go with it.

Try out my patch and see how much improvement you see in thrashing. This is not a final patch; we should probably have a separate thread pool for fibers so we can label them appropriately.

@headius
Member
headius commented Jan 28, 2014

A small clarification: JRuby 1.7.4- did reuse native thread, ThreadContext, and RubyThread for rapid-fire creation of fibers, but did not do any of the scrubbing necessary to ensure data from one Fiber did not leak into the next Fiber. My patch re-enables reuse of the native thread, but also explicitly dumps the ThreadContext and RubyThread. That's why you see the Ruby Thread object_id still increasing with my patch.

@cheald
Contributor
cheald commented Jan 28, 2014

Got it. Cool, this looks like it should work nicely. I'll give this a shot and let you know how it works.

@cheald
Contributor
cheald commented Jan 30, 2014

Using your patch and this crappy benchmark I'm getting the following:

1.7.10 1.7.10-patched 1.7.0-fiber-pool 1.7.0-patched-fiber-pool
26.57 sec 12.09 sec 9.32 sec 9.01 sec

The thread pooling definitely makes a big difference, but it looks like fiber pooling is still a big win. The combination of the two is even better.

@headius
Member
headius commented Jan 30, 2014

I will look at getting this patch in place an tested for 1.7.11.

@headius headius closed this in e10e419 Feb 13, 2014
@enebo enebo added this to the JRuby 1.7.11 milestone Feb 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment