Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anonymous class-type being raised by Timeout/Queue in both 9x and 1x. #3834

Closed
digitalextremist opened this Issue Apr 28, 2016 · 18 comments

Comments

Projects
None yet
3 participants
@digitalextremist
Copy link
Contributor

digitalextremist commented Apr 28, 2016

  • Linux 3.13.0-48-generic #80-Ubuntu SMP x86_64

Both...

  • jruby 1.7.22 (2.0.0p598) 2016-03-25 c28f492 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03 +jit [linux-amd64]
  • jruby 9.0.5.0 (2.2.3) 2016-03-30 7bee00d Java HotSpot(TM) 64-Bit Server VM 25.77-b03 on 1.8.0_77-b03 +jit [linux-amd64]

Working with Celluloid, and the http.rb gem, using Timeout.timeout on HTTP calls; in specific cases I can always crash an actor because I am unable to catch the exception raised.

I expected Timeout.timeout to raise Timeout::Error ...not this:

E, [2016-04-28T05:11:24.836000 #8088] ERROR -- : Actor crashed!
#<Class:0x915809c>: execution expired
  org/jruby/ext/thread/Queue.java:152:in `pop'
  org/jruby/ext/thread/Queue.java:127:in `pop'
  org/jruby/ext/thread/SizedQueue.java:111:in `pop'
  org/jruby/ext/fiber/ThreadFiber.java:183:in `yield'
  org/jruby/ext/fiber/ThreadFiber.java:166:in `yield'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:21:in `signal'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:72:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:11:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid.rb:139:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:41:in `response'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:45:in `value'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:22:in `method_missing'
  /usr/local/rvm/rubies/jruby-1.7.22/lib/ruby/1.9/forwardable.rb:201:in `get_json'
  /home/filter/filter/preprocessor/analyzer/tester.rb:60:in `grey_white?'
  /home/filter/filter/preprocessor/analyzer.rb:121:in `process'
  org/jruby/ext/timeout/Timeout.java:115:in `timeout'
  /home/filter/filter/preprocessor/analyzer.rb:52:in `process'
  org/jruby/RubyKernel.java:1930:in `public_send'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `task'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `task'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `initialize'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `create'
E, [2016-04-28T05:11:24.946000 #8088] ERROR -- : thread crashed
#<Class:0x915809c>: execution expired
  org/jruby/ext/thread/Queue.java:152:in `pop'
  org/jruby/ext/thread/Queue.java:127:in `pop'
  org/jruby/ext/thread/SizedQueue.java:111:in `pop'
  org/jruby/ext/fiber/ThreadFiber.java:183:in `yield'
  org/jruby/ext/fiber/ThreadFiber.java:166:in `yield'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:21:in `signal'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:72:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:11:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid.rb:139:in `suspend'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:41:in `response'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:45:in `value'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:22:in `method_missing'
  /usr/local/rvm/rubies/jruby-1.7.22/lib/ruby/1.9/forwardable.rb:201:in `get_json'
  /home/filter/filter/preprocessor/analyzer/tester.rb:60:in `grey_white?'
  /home/filter/filter/preprocessor/analyzer.rb:121:in `process'
  org/jruby/ext/timeout/Timeout.java:115:in `timeout'
  /home/filter/filter/preprocessor/analyzer.rb:52:in `process'
  org/jruby/RubyKernel.java:1930:in `public_send'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `dispatch'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `task'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `task'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `initialize'
  /usr/local/rvm/gems/jruby-1.7.22/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `create'

You can see by the message that it's a timeout, but the class which is raised bearing that message and backtrace is not Timeout::Error.

I've tried everything to catch that class type. Including using this sort of rescue:

rescue StandardError, Exception, Timeout::Error, Celluloid::TaskTerminated => ex

... which ought to be redundant in the extreme, but the class still escapes capture.

Unfortunately, to post a reproducible piece of code I would need to extract some from client code, but also replicate conditions that I've isolated in a certain context:

This is no normal HTTP timeout, though it does happen in that case at times -- less reliably. But it always happens when we take all nodes out of a load balancer, and then hit the load balancer with a GET request.

headius added a commit that referenced this issue Apr 28, 2016

When duping Struct accessor methods, ensure they actually dup.
Partial fix for #3834

The main issue here is that we frequently return this for
DynamicMethod#dup, rather than return a new instance, and then
this gets modified by code expecting it to be something different.
This breaks reflective code because many reflective methods will
expect the methods in a module's method table to say their
implementationClass is that module.

The case in #3834 broke because the accessor methods in Struct
have this "fake" dup logic, so when the struct class itself was
duplicated, the methods got modified implementationClass = clone
and seemed to "disappear" from the original.

This is only a partial fix because it just fixes those accessors
in Struct. The proper fix would be to make sure DynamicMethod#dup
returns a real new object for all DynamicMethod (except "dead"
ones like UndefinedMethod). This may also fix (or make easier to
fix) issues like #2988, where a dup'ed module's methods do not
reflect their new implementationClass.

Specs pending.
@digitalextremist

This comment has been minimized.

Copy link
Contributor Author

digitalextremist commented Apr 28, 2016

@headius

This comment has been minimized.

Copy link
Member

headius commented Apr 28, 2016

@digitalextremist Ok, yeah...the exception seems to be raised by someone else via our thread eventing. That's harder to investigate since we don't have logging of cross-thread exception raises right now.

The exception itself looks like it's probably the anonymous exception Timeout throws.

It's unusual to me that it's bubbling out of a yield. Timing out in yield would mean that fiber A resumes fiber B, fiber B yields back to A, and A times out and bubbles this exception back to B. That's an unusual direction for exceptions to flow...generally they only come out of resume, not out of yield.

@headius

This comment has been minimized.

Copy link
Member

headius commented Apr 28, 2016

Ok, in order to continue on this I think we need a reproduction, or @digitalextremist will need to do some debugging locally.

A patch to start investigating this would be to add logging to ThreadFiber.exchangeWithFiber's catch block to print out all exceptions. I suspect it is propagating something the wrong direction.

@headius headius added this to the JRuby 9.1.1.0 milestone Apr 28, 2016

@headius

This comment has been minimized.

Copy link
Member

headius commented Apr 28, 2016

I remember there used to be a Thread-based mode for Celluloid too...perhaps it still exists and we can try it?

@headius

This comment has been minimized.

Copy link
Member

headius commented Apr 28, 2016

Another place to debug would be RubyThread.receiveMail. This method is called when thread A wants to deliver a kill or raise event to thread B, and it runs in the context of thread A. We could see who is triggering this cross-thread (cross-fiber) exception then.

@digitalextremist

This comment has been minimized.

Copy link
Contributor Author

digitalextremist commented Apr 28, 2016

@headius I was able to work around this for now by cutting out Timeout.timeout and using the HTTP timeouts :per_operation which I believe triggers a timeout on a socket-level, rather than using the infamous Timeout API.

Either way I remember the situation clearly and will try to come back with a repro that works. I think the key here in reproducing it reliably would be to have a TCPServer in the repro itself that opens a connection but then does nothing, so there's not an immediate error of connection refused or similar.

Thank you very much for looking through this. Our interactions caused me to think of the workaround, brought up things in Celluloid that are fruitful. Like I said, I'm definitely down to ship beer :)

@headius

This comment has been minimized.

Copy link
Member

headius commented May 11, 2016

Still a bit of a mystery and unlikely to be investigated for the quick 9.1.1 release. Punting.

@headius headius modified the milestones: JRuby 9.1.2.0, JRuby 9.1.1.0 May 11, 2016

@digitalextremist

This comment has been minimized.

Copy link
Contributor Author

digitalextremist commented May 13, 2016

@headius this resurfaced with Rubinius, except the class did bubble up in that case.

In that incarnation, it appears to be Mutex#synchronize being bombarded... timing out at this specific line:

Which makes sense, because here is what complains:

@headius

This comment has been minimized.

Copy link
Member

headius commented May 13, 2016

Perhaps I missed it, but what was the actual exception class that bubbled out?

@digitalextremist

This comment has been minimized.

Copy link
Contributor Author

digitalextremist commented May 13, 2016

Pardon me, it's Timeout::Error

@headius

This comment has been minimized.

Copy link
Member

headius commented May 13, 2016

Ok...so is this just getting through because nobody expects to rescue it? Is there anything we need to fix on the JRuby side? I assume since the same issue appeared on Rubinius, it may be an exception-handling gap in Celluloid...

@digitalextremist

This comment has been minimized.

Copy link
Contributor Author

digitalextremist commented May 13, 2016

I'm troubleshooting the Celluloid end -- but the fact that the class being bounced on the jRuby side is an anonymous class still persists; although yes, something is falling through on the Celluloid side.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 16, 2016

Ok, I can at least investigate why the class is anonymous.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 17, 2016

It looks like the anonymous exception code is very old...no longer present in MRI's timeout.rb after August 2013. I will do a pass over timeout.rb and our version and try to ensure they match.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 17, 2016

Ok, so here's the deal.

If you provide a specific exception type to Timeout::timout, it will raise a new instance of that exception when the timer runs out. If you do not provide an error. It eventually raises Timeout::Error. I say eventually, because MRI's timeout.rb has used various other exceptions to actually unroll the timeout-protected code back to the code in timeout.rb that decides how to propagate the exception.

At one point, it did Class.new(Exception) every time you called for a timeout guard. This is the anonymous exception we emulate currently (inside an exception type called AnonymousException but with its name removed to match MRI.

Later, they realized the high cost of creating a new class every time and instead created a single ExitException type they used to unroll.

And most recently it appears they now simply raise Timeout::Error right away but post-process it a bit to scrub out the backtrace lines related to Timeout::timeout.

The original intent was that this should be an exception users never see and never rescue so it propagates all the way out to the timeout call.

Now, the fact that you saw this exception makes me think we're not properly handling it when it bubbles out, and it's continuing to propagate into user code. That could be a bug, or more likely it's a problem with the asynchronous nature of Timeout causing the anon exception to be raised outside the timeout call.

I will do some experimenting and see if I can get this to reproduce based on my theory.

headius added a commit to headius/jruby that referenced this issue May 17, 2016

Test out MRI's timeout.rb unmodified to fix jruby#3834.
Our native Timeout has not been updated in a long time, and for

* A separate anonymous exception is not used to trigger the
  timeout anymore; instead, they use catch/throw.
* The exception message string is pre-allocated and frozen.
* Various other miscellaneous changes.

This version has *not* been tested for performance, and it will
initially be much slower than our scheduled executor-based version
due to it always spinning up a timeout thread. But we can see how
it goes in test suites and then evaluate how to move forward.

Note that the change to test/jruby/test_timeout.rb was to bring
behavior in line with MRI 2.3.
@headius

This comment has been minimized.

Copy link
Member

headius commented May 17, 2016

I have pushed a new_timeout branch to my fork (which reverts us back to using MRI's timeout.rb) and opened #3895 to test it out. If you get a chance, you could try to reproduce your issues on a build from that branch.

I doubt will stick with unmodified timeout.rb since it has a few big performance issues, but this will at least let us know how well it integrates into the rest of the system.

headius added a commit to headius/jruby that referenced this issue May 18, 2016

Test out MRI's timeout.rb unmodified to fix jruby#3834.
Our native Timeout has not been updated in a long time, and for

* A separate anonymous exception is not used to trigger the
  timeout anymore; instead, they use catch/throw.
* The exception message string is pre-allocated and frozen.
* Various other miscellaneous changes.

This version has *not* been tested for performance, and it will
initially be much slower than our scheduled executor-based version
due to it always spinning up a timeout thread. But we can see how
it goes in test suites and then evaluate how to move forward.

Note that the change to test/jruby/test_timeout.rb was to bring
behavior in line with MRI 2.3.

headius added a commit to headius/jruby that referenced this issue May 18, 2016

Test out MRI's timeout.rb unmodified to fix jruby#3834.
Our native Timeout has not been updated in a long time, and for

* A separate anonymous exception is not used to trigger the
  timeout anymore; instead, they use catch/throw.
* The exception message string is pre-allocated and frozen.
* Various other miscellaneous changes.

This version has *not* been tested for performance, and it will
initially be much slower than our scheduled executor-based version
due to it always spinning up a timeout thread. But we can see how
it goes in test suites and then evaluate how to move forward.

Note that the change to test/jruby/test_timeout.rb was to bring
behavior in line with MRI 2.3.

headius added a commit to headius/jruby that referenced this issue May 20, 2016

Test out MRI's timeout.rb unmodified to fix jruby#3834.
Our native Timeout has not been updated in a long time, and for

* A separate anonymous exception is not used to trigger the
  timeout anymore; instead, they use catch/throw.
* The exception message string is pre-allocated and frozen.
* Various other miscellaneous changes.

This version has *not* been tested for performance, and it will
initially be much slower than our scheduled executor-based version
due to it always spinning up a timeout thread. But we can see how
it goes in test suites and then evaluate how to move forward.

Note that the change to test/jruby/test_timeout.rb was to bring
behavior in line with MRI 2.3.

@enebo enebo modified the milestones: JRuby 9.1.2.0, JRuby 9.1.3.0 May 23, 2016

@enebo enebo modified the milestones: JRuby 9.1.3.0, JRuby 9.1.2.0 May 23, 2016

@headius

This comment has been minimized.

Copy link
Member

headius commented Aug 22, 2016

Workarounds include not using timeout and using it with an explicit exception, so I think we can punt any remaining rewrite of timeout.rb (or re-incorporation of MRI's version, with bottlenecks fixed) to 9.1.4.0.

@headius headius modified the milestones: JRuby 9.1.4.0, JRuby 9.1.3.0 Aug 22, 2016

@headius headius modified the milestones: JRuby 9.2.0.0, JRuby 9.1.6.0 Nov 8, 2016

@headius headius modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 15, 2018

headius added a commit to headius/jruby that referenced this issue Oct 11, 2018

Realign timeout with MRI.
Fixes jruby#3834.

* Use Timeout::Error instead of leaky anonymous exception.
* Move executor to instance data from static.
* Clean up redundant, outdated code.
* Move sharable pieces back into timeout.rb.
* Eliminate builtin and load ext from timeout.rb.
@headius

This comment has been minimized.

Copy link
Member

headius commented Oct 11, 2018

I've pushed an updated implementation of Timeout that shares more code with MRI and always uses the same Timeout::Error exception rather than our weird anonymous one. I believe this will resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.