New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM due to unbounded rescuePCs growth #4865

Closed
andrewdotn opened this Issue Nov 23, 2017 · 16 comments

Comments

Projects
None yet
4 participants
@andrewdotn

andrewdotn commented Nov 23, 2017

Environment

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)

      $ jruby -v
      jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [darwin-x86_64]
    
  • Operating system and platform (e.g. uname -a)

    Observed on both CentOS 7 and Mac OS 10.13.1

Actual Behavior

The following short program, extracted from a much larger application, runs out of memory due to unbounded growth in StartupInterpreterEngine.rescuePCs.

$ cat foo.rb 
require_relative 'bar'

run(Queue.new)
$ cat bar.rb 
def run(work_queue)
  while true
    begin
      work = work_queue.pop(true)
    rescue ThreadError => e
      next
    end
  end
end
$ time JRUBY_OPTS="-w -J-Xmx32m" jruby foo.rb
Error: Your application used more memory than the safety cap of 32M.
Specify -J-Xmx####M to increase it (#### = cap size in MB).
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3210)
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.Vector.grow(Vector.java:266)
	at java.util.Vector.ensureCapacityHelper(Vector.java:246)
	at java.util.Vector.addElement(Vector.java:620)
	at java.util.Stack.push(Stack.java:67)
	at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:101)
	at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165)
	at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200)
	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:318)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:155)
	at foo.invokeOther3:run(foo.rb:3)
	at foo.RUBY$script(foo.rb:3)
	at java.lang.invoke.LambdaForm$DMH/989110044.invokeStatic_L7_L(LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$BMH/1182461167.reinvoke(LambdaForm$BMH)
	at java.lang.invoke.LambdaForm$MH/198761306.invoker(LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/1058025095.invokeExact_MT(LambdaForm$MH)
	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
	at org.jruby.ir.Compiler$1.load(Compiler.java:95)
	at org.jruby.Ruby.runScript(Ruby.java:828)
	at org.jruby.Ruby.runNormally(Ruby.java:747)
	at org.jruby.Ruby.runNormally(Ruby.java:765)
	at org.jruby.Ruby.runFromMain(Ruby.java:578)
	at org.jruby.Main.doRunFromMain(Main.java:417)
	at org.jruby.Main.internalRun(Main.java:305)
	at org.jruby.Main.run(Main.java:232)
	at org.jruby.Main.main(Main.java:204)

real	0m41.864s
user	0m51.344s
sys	0m0.495s
Returned 1.
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 24, 2017

Member

Wow this one is pretty odd. It does not happen without the second file??? I really expected this to be something where the next somehow is missing hitting a matching push to our pop but it looks like it works ok if called from a lexical parent but not if it is not.

Member

enebo commented Nov 24, 2017

Wow this one is pretty odd. It does not happen without the second file??? I really expected this to be something where the next somehow is missing hitting a matching push to our pop but it looks like it works ok if called from a lexical parent but not if it is not.

@enebo enebo added the ir label Nov 24, 2017

@enebo enebo added this to the JRuby 9.1.15.0 milestone Nov 24, 2017

@andrewdotn

This comment has been minimized.

Show comment
Hide comment
@andrewdotn

andrewdotn Nov 24, 2017

It doesn’t seem related to lexical scope, but to compiling. If I disable compilation, a second file isn’t needed:

$ cat foo.rb
while true
  begin
    raise "oops"
  rescue => e
    next
  end
end
$ jruby -J-Djruby.ir.debug -J-Djruby.compile.mode=OFF foo.rb 2>&1 | grep RPCs=1000
[main] INFO Interpreter : I: {15} toggle_backtrace(;true); <#RPCs=1000>
...

andrewdotn commented Nov 24, 2017

It doesn’t seem related to lexical scope, but to compiling. If I disable compilation, a second file isn’t needed:

$ cat foo.rb
while true
  begin
    raise "oops"
  rescue => e
    next
  end
end
$ jruby -J-Djruby.ir.debug -J-Djruby.compile.mode=OFF foo.rb 2>&1 | grep RPCs=1000
[main] INFO Interpreter : I: {15} toggle_backtrace(;true); <#RPCs=1000>
...
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 24, 2017

Member

@andrewdotn ah yeah we compile main file by default. Thanks for that update.

Member

enebo commented Nov 24, 2017

@andrewdotn ah yeah we compile main file by default. Thanks for that update.

@andrewdotn

This comment has been minimized.

Show comment
Hide comment
@andrewdotn

andrewdotn Nov 24, 2017

I also see leaks with

while true
  while true
    begin
      raise "oops"
    rescue => e
      break
    end
  end
end

and

while true
  begin
    raise "oops"
  rescue => e
    redo
  end
end

I’ve been trying to make IRBuilder add an extra ExceptionRegionEndMarkerInstr before the jump in these cases, but while they stop the leaks on the simple examples here, it causes real applications to break later in CFG, and I haven’t dug into that yet.

andrewdotn commented Nov 24, 2017

I also see leaks with

while true
  while true
    begin
      raise "oops"
    rescue => e
      break
    end
  end
end

and

while true
  begin
    raise "oops"
  rescue => e
    redo
  end
end

I’ve been trying to make IRBuilder add an extra ExceptionRegionEndMarkerInstr before the jump in these cases, but while they stop the leaks on the simple examples here, it causes real applications to break later in CFG, and I haven’t dug into that yet.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 27, 2017

Member

This seems like a basic VM-level issue in our IR interpreter. The logic as compiled into JVM bytecode works just fine, because the management of try/catch regions is handled properly by the JVM. Any branch out of the "try" area automatically leaves that area, while for us it must be done manually at each exit point.

Member

headius commented Nov 27, 2017

This seems like a basic VM-level issue in our IR interpreter. The logic as compiled into JVM bytecode works just fine, because the management of try/catch regions is handled properly by the JVM. Any branch out of the "try" area automatically leaves that area, while for us it must be done manually at each exit point.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 27, 2017

Member

At first glance we create begin/end exc range which is great for building our CFG but then we try to leverage these as actual usable instrs in our startup interp, Their design is not really intended to pair up (e.g. start called then end called) as they are really for marking regions.

This is really only a problem in these cases that are infinite. It is also mildly ironic that we do not compile methods like this because they never exit so it is sort of an unhappy combination.

Member

enebo commented Nov 27, 2017

At first glance we create begin/end exc range which is great for building our CFG but then we try to leverage these as actual usable instrs in our startup interp, Their design is not really intended to pair up (e.g. start called then end called) as they are really for marking regions.

This is really only a problem in these cases that are infinite. It is also mildly ironic that we do not compile methods like this because they never exit so it is sort of an unhappy combination.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 27, 2017

Member

More notes. I tried pushing a new start exc only if it was different than top of stack (based on idea that lexically you could never raise to the same place twice) but this had issues with us prematurely popping too many elements. I cannot just ignore empty stack because I might have prematurely nuked a rescue ip we need.

@subbuss suggested another hack would be for start/end to both record the label they pair with and make sure to only pop if the push label matches.

Member

enebo commented Nov 27, 2017

More notes. I tried pushing a new start exc only if it was different than top of stack (based on idea that lexically you could never raise to the same place twice) but this had issues with us prematurely popping too many elements. I cannot just ignore empty stack because I might have prematurely nuked a rescue ip we need.

@subbuss suggested another hack would be for start/end to both record the label they pair with and make sure to only pop if the push label matches.

@subbuss

This comment has been minimized.

Show comment
Hide comment
@subbuss

subbuss Nov 27, 2017

Contributor

Deleted my previous comment because I don't think that simplified logic is right ... but yes, this is a problem of retrofitting the startup interpreter on top of the IR meant for CFG construction. The hacky soln should work for now. Will let @enebo test it. If there is a cleaner non-hacky soln that we can think of, we can use it instead.

Contributor

subbuss commented Nov 27, 2017

Deleted my previous comment because I don't think that simplified logic is right ... but yes, this is a problem of retrofitting the startup interpreter on top of the IR meant for CFG construction. The hacky soln should work for now. Will let @enebo test it. If there is a cleaner non-hacky soln that we can think of, we can use it instead.

@enebo enebo closed this in 98d1074 Nov 27, 2017

enebo added a commit that referenced this issue Nov 27, 2017

Fixes #4865. OOM due to unbounded rescuePCs growth.
The comment in this commit hopefully spells out how the solution works but one
tl;dr is we cannot guarantee start and end exception region instructions will
occur in the presence of branches or jumps.  The workaround is to clean up the
stack whenever we decide to add a new exception region.

This had a small side-benefit of removing a hack we must have added when seeing
a jump-like instr just doing a pop and hoping for the best.
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 27, 2017

Member

@andrewdotn since you were digging in last week I take it you can test this and give it the gold seal of approval :)

Looking at this problem in retrospect it seems obvious our solution was pretty lacking but we would only be able to observe this in an endless loop function which was not the main program file (which we compile by default). Even a huge finite loop I bet took a long time to OOM.

Member

enebo commented Nov 27, 2017

@andrewdotn since you were digging in last week I take it you can test this and give it the gold seal of approval :)

Looking at this problem in retrospect it seems obvious our solution was pretty lacking but we would only be able to observe this in an endless loop function which was not the main program file (which we compile by default). Even a huge finite loop I bet took a long time to OOM.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 27, 2017

Member

@andrewdotn Can you provide a test or spec that exercises this? Probably as part of either spec/compiler/general_spec.rb or something in test/jruby.

Member

headius commented Nov 27, 2017

@andrewdotn Can you provide a test or spec that exercises this? Probably as part of either spec/compiler/general_spec.rb or something in test/jruby.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 27, 2017

Member

I don't think this is amenable to testing. Even with 32m heap it takes like 40 seconds to OOME

Member

enebo commented Nov 27, 2017

I don't think this is amenable to testing. Even with 32m heap it takes like 40 seconds to OOME

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 28, 2017

Member

@enebo We can write a test that checks the stack size, though. Maybe it should be in Java.

Member

headius commented Nov 28, 2017

@enebo We can write a test that checks the stack size, though. Maybe it should be in Java.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 28, 2017

Member

@headius it is a local variable.

Member

enebo commented Nov 28, 2017

@headius it is a local variable.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 28, 2017

Member

@enebo Ahh I didn't notice it was not part of the interpreter's instance state.

I tweaked the given script a bit to reduce allocation and eliminate backtrace generation, and it only takes about 5s to OOM with -Xmx32m.

args = [Exception, "foo", []]
while true
  begin
    raise *args
  rescue java::lang::OutOfMemoryError
    exit 1
  rescue Exception
    next
  end
end

It's still a bit clunky for a test.

Member

headius commented Nov 28, 2017

@enebo Ahh I didn't notice it was not part of the interpreter's instance state.

I tweaked the given script a bit to reduce allocation and eliminate backtrace generation, and it only takes about 5s to OOM with -Xmx32m.

args = [Exception, "foo", []]
while true
  begin
    raise *args
  rescue java::lang::OutOfMemoryError
    exit 1
  rescue Exception
    next
  end
end

It's still a bit clunky for a test.

@andrewdotn

This comment has been minimized.

Show comment
Hide comment
@andrewdotn

andrewdotn Nov 28, 2017

Wow, thanks for the amazing turnaround time on this! I can confirm that your fix works. When I was debugging this, I added a check inside StartupInterpreterEngine.interpret() that would panic if rescuePCs grew larger than 100 elements. I think there could be a Java unit test along those lines, e.g., by checking that after 100 loops rescuePCs was a sane size. It would need rescuePCs to be exposed for testing.

If I was adding a test for my own code here, and I wasn’t worried about performance, I’d add a Consumer<Integer> rescuePCsListenerForTesting parameter, along with an overload that would keep the existing signature by defaulting the new parameter to null. If non-null, the consumer would be passed the size of rescuePCs on every push, so that the test could verify it didn’t grow excessively during the loop. I know that would be inelegant but it would be fast and it would work.

andrewdotn commented Nov 28, 2017

Wow, thanks for the amazing turnaround time on this! I can confirm that your fix works. When I was debugging this, I added a check inside StartupInterpreterEngine.interpret() that would panic if rescuePCs grew larger than 100 elements. I think there could be a Java unit test along those lines, e.g., by checking that after 100 loops rescuePCs was a sane size. It would need rescuePCs to be exposed for testing.

If I was adding a test for my own code here, and I wasn’t worried about performance, I’d add a Consumer<Integer> rescuePCsListenerForTesting parameter, along with an overload that would keep the existing signature by defaulting the new parameter to null. If non-null, the consumer would be passed the size of rescuePCs on every push, so that the test could verify it didn’t grow excessively during the loop. I know that would be inelegant but it would be fast and it would work.

enebo added a commit that referenced this issue Nov 28, 2017

Addendum fix to #4865 from running out of stack. The thing I did not …
…understand

in original fix is that GEB and exception region for exceptions raised in
ensures is that the all push the same label and nest.  The fix is simple we
now capture the instr itself since it is unique and continue using the pruning
technique.  The original solutions only mistake was not realizing we would
nest regions to the same destination.

enebo added a commit that referenced this issue Nov 28, 2017

Addendum fix to #4865 from running out of stack. The thing I did not …
…understand

in original fix is that GEB and exception region for exceptions raised in
ensures is that the all push the same label and nest.  The fix is simple we
now capture the instr itself since it is unique and continue using the pruning
technique.  The original solutions only mistake was not realizing we would
nest regions to the same destination.
@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Nov 28, 2017

Member

Ok so I found another issue but my confidence is quite a bit higher now. The reason I see no value in testing the stack size for unbounded growth is the algorithm on insertion will look to see if it is already on the stack and if so potentially reduce the size of the stack (but only grow if not present). So there should never be unbounded growth from that alone. However, we only bother to add these stack elements when we traverse a lexical section of code. I am super confident we will never be more than n lexical nestings of stack size.

Member

enebo commented Nov 28, 2017

Ok so I found another issue but my confidence is quite a bit higher now. The reason I see no value in testing the stack size for unbounded growth is the algorithm on insertion will look to see if it is already on the stack and if so potentially reduce the size of the stack (but only grow if not present). So there should never be unbounded growth from that alone. However, we only bother to add these stack elements when we traverse a lexical section of code. I am super confident we will never be more than n lexical nestings of stack size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment