Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread leak with the `biz` gem #5671

Open
guizmaii opened this Issue Mar 29, 2019 · 7 comments

Comments

Projects
None yet
2 participants
@guizmaii
Copy link

guizmaii commented Mar 29, 2019

Environment

➜  jruby -v
jruby 9.2.6.0 (2.5.3) 2019-02-11 15ba00b OpenJDK 64-Bit Server VM 11.0.2+9 on 11.0.2+9 +jit [darwin-x86_64]
➜  uname -a
Darwin Juless-MacBook-Pro 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64

biz (1.8.2) and Rails 4.2.11.1

This code: https://github.com/zendesk/biz/blob/master/lib/biz/periods/linear.rb#L23-L37
creates an Enumerator which will instantiate a thread (actually a Fiber backed by a thread) and that fiber is never released.

@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 29, 2019

I did a little experimenting with brutally creating fibers and Enumerator#next threads using a snippit of code similar to this:

Fiber:

loop {
  e = Fiber.new {
    1.times { Fiber.yield 1 }
  }
  e.resume
  e.resume
  JRuby.gc
}

Enumerator:

loop {
  e = Enumerator.new {|y|
    1.times { y << 1 }
  }
  e.next
  JRuby.gc
}

In the Fiber case, we "resume" twice because the first time starts the fiber and yields "1", which pauses the fiber. The additional "resume" allows the fiber to terminate. Running with this code yields no thread leaks...only one worker thread is ever used because the fibers finish and return the thread to its pool.

If I remove one or both resume calls, the fibers do still get cleaned up, but this loop appears to be too fast for the JVM to keep up with. In a normal application I'd expect it to keep up better, but it's not advisable to spin up scads of fibers very rapidly on JRuby.

The Enumerator#next version only calls next once, because a subsequent all would raise StopIteration and that kills the loop. With the code as shown, it's just like the case of a Fiber that has not been allowed to completely drain: the old enums are getting collected and the threads are getting returned to the pool, but not rapidly enough to keep up with this benchmark.

The code @guizmaii linked in zendesk/biz uses the generic Enumerator.new form to create a virtual enumerable. This enumerator is stored in the "Linear" object, but it's not clear what the lifespan of that object or its referrents might be. An analysis with Eclipse Memory Analyzer could only find thread roots (i.e. a local variable on a live stack somewhere) for these enumerator objects.

There could be a bug in how we're handling the reference chain from this sort of enumerator to its "looper".

@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 29, 2019

My comment above is a long way of saying that if this weren't eventually cleaning up those enumerators, it should blow up pretty quickly if this logic is hit frequently. The fact that you could run it in production at all makes me think this may simply be a problem of the enumerators being left dangling, rather than a fixable bug in JRuby. Still investigating though.

@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 29, 2019

This code is also related: https://github.com/zendesk/biz/blob/b10ccfe55b61ef586907d36bbbfd4b7ede9d29b8/lib/biz/periods/abstract.rb#L26-L33

The logic here takes this special hand-made enumerator, makes it lazy, stacks a couple levels of logic on that, and then returns the still-uninitiated enumerator. Anywhere along this chain it could be introducing a leak, possibly in the lazy enumerator logic.

@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 29, 2019

Here's example code for a lazy enumerator that does not use next, and fully runs the chain of transforms:

loop {
  e = (1..5).each.lazy.map{|i| i+1}.map{|i| i-1}.to_a
}

This maintains a steady number of threads and does not leak.

Replace to_s with next and it dies very quickly of thread starvation.

Another version that does use next but completely drains the enumerator works properly. The relevant code:

...
while e.next; end rescue nil
...

The rescue is for the final StopIteration that indicates the enumerator is done, which means we'll release the thread.

@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 29, 2019

Circling back to the "biz" library, the following code blows up very quickly:

loop { Biz.periods.on(Date.today).first }

Biz.periods is the lazy enumerator I showed in #5671 (comment), which has at least three threads associated with it for map, flat_map, and the original yielder loop. The thread is started, a single result is retrieved, and then the whole shebang is walked away from. It eventually does GC (adding JRuby.gc does keep it running for a while) but this example is creating them way too fast for the GC and finalizer logic to keep up.

headius added a commit to headius/jruby that referenced this issue Mar 30, 2019

Use pure-Ruby Fiber-based Enumerator#next.
This code has been passed around multiple implementations as a
pure-Ruby version of Enumerator#next that just uses fibers. We
have had many issues with the separate fiber-like impl we used
for Enumerator#next, and the logical path forward is to unify that
with Fiber via this code.

There are a few caveats to this migration:

* Three new introduced failures appear to be shared with other
  impls using this logic (rbx, truffleruby).
* This commit breaks the fast-path Array logic for Enumerator#next
  so it's always going through a thread. However the protocol here
  would allow us to implement #to_generator on any object, which
  should make it possible to short-circuit more easily (in Ruby!).
* I have not measured performance. However the logic surrounding
  the Fiber calls should not be particularly heavy.
* I have not vetted the cleanup side of this, except to confirm
  that this implementation seems about as good at jruby#5671
  in that it does appear to GC, but for a high-churn benchmark it
  eventually collapses without explicit GCs. So...no worse?

This code is modified from TruffleRuby's copy of Rubinius's code,
with Rubinius's license being very liberal and TruffleRuby's being
identical to JRuby's.
@headius

This comment has been minimized.

Copy link
Member

headius commented Mar 30, 2019

@guizmaii I have created #5672 which eliminates the duplicate "fiber" implementation that we maintained solely for Enumerable, but I'm not sure it mitigates your issue. If you're able to test that next week with a HEAD build of JRuby, that would be great.

At the moment I'm leaning toward this being a systemic problem with the extensive use of Enumerator.new in the biz gem; I do not know how to fix it yet.

@headius

This comment has been minimized.

Copy link
Member

headius commented Apr 3, 2019

Ok, so here's the status as of today.

Pure Ruby Enumerator#next

This appears to be mostly working, but one of our tests for leaking threads got flagged. More work is required. @guizmaii reported that this does not really improve their situation, which is what I expected.

Biz library use of Enumerators

After some review of the library, it appears that most of the users of Enumerator do not require "next" behavior. The leaking thread showing up in @guizmaii's app appear to mostly be from a secondary Enumerator used as a factory.

A script created by @guizmaii or his team that illustrates a typical case that's leaking:

require 'biz'

class BizThreadLeak
 def self.biz_schedule
  Biz::Schedule.new do |config|
   config.hours = { mon: { '00:00' => '00:01' } }
   config.time_zone = "Europe/Paris"
  end
 end

 def self.time_segments(earliest_start, latest_end)
  biz_schedule
   .periods
   .after(earliest_start)
   .timeline
   .until(latest_end)
   .to_a
 end
end

DAYS_IN_SECONDS = 86400 # 24 * 60 * 60

earliest_start = Time.now
latest_end = Time.now + (15 * DAYS_IN_SECONDS)

loop do
 res = BizThreadLeak.time_segments(earliest_start, latest_end)
 puts "res is : #{res}"
 sleep 1
end

As you can see here, the time_segments method does attempt to drain the Linear biz enumerator that it drives, by calling to_a at the end. In my testing, to_a does properly run out and terminate threaded enumerators. However I'm not even sure this enumerator gets a thread; it is run entirely using internal iteration down to to_a, which just does a normal, Enumerator-draining each loop.

The leaking enumerator appears to be the "periods" used by this enumerator as part of a flat mapping.

The periods call, creates a Biz::Periods::Proxy: https://github.com/zendesk/biz/blob/master/lib/biz/periods/proxy.rb

The after call triggers that proxy to use an Biz::Periods::After: https://github.com/zendesk/biz/blob/master/lib/biz/periods/after.rb

This in turn has the following stack leading into Linear#linear_periods:

/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/linear.rb:24:in `linear_periods'
/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/linear.rb:13:in `initialize'
/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/abstract.rb:28:in `periods'
/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/abstract.rb:11:in `initialize'
/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/after.rb:12:in `initialize'
/Users/headius/projects/jruby/lib/ruby/gems/shared/gems/biz-1.8.2/lib/biz/periods/proxy.rb:12:in `after'

And that code drives the periods and shifts enumerators via next: https://github.com/zendesk/biz/blob/master/lib/biz/periods/linear.rb#L23-L37

It appears to be those inner enumerators that are leaking. Most of the period generators are set up to run for either an infinite range, or a range beginning or ending at the ends of the universe. They do not ever get drained completely, nor are they ever reset or rewound, which ultimately causes them to take the slow path for cleanup.

I think it would be helpful to get the Biz folks involved. It may be possible for us to make all these intermediate enumerators behave better by cleaning up partial results.

headius added a commit to headius/jruby that referenced this issue Apr 6, 2019

Provide better thread name for enumerator/fiber workers.
This patch sets the name of the threads used to drive fibers (or
the similar Enumerator#next coroutines) in order to more easily
audit and debug those threads. Relates to jruby#5671 and fixes jruby#5670.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.