Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM due to unbounded rescuePCs growth #4865

Closed
andrewdotn opened this issue Nov 23, 2017 · 16 comments
Closed

OOM due to unbounded rescuePCs growth #4865

andrewdotn opened this issue Nov 23, 2017 · 16 comments
Labels

Comments

@andrewdotn
Copy link

@andrewdotn andrewdotn commented Nov 23, 2017

Environment

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)

      $ jruby -v
      jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [darwin-x86_64]
    
  • Operating system and platform (e.g. uname -a)

    Observed on both CentOS 7 and Mac OS 10.13.1

Actual Behavior

The following short program, extracted from a much larger application, runs out of memory due to unbounded growth in StartupInterpreterEngine.rescuePCs.

$ cat foo.rb 
require_relative 'bar'

run(Queue.new)
$ cat bar.rb 
def run(work_queue)
  while true
    begin
      work = work_queue.pop(true)
    rescue ThreadError => e
      next
    end
  end
end
$ time JRUBY_OPTS="-w -J-Xmx32m" jruby foo.rb
Error: Your application used more memory than the safety cap of 32M.
Specify -J-Xmx####M to increase it (#### = cap size in MB).
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3210)
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.Vector.grow(Vector.java:266)
	at java.util.Vector.ensureCapacityHelper(Vector.java:246)
	at java.util.Vector.addElement(Vector.java:620)
	at java.util.Stack.push(Stack.java:67)
	at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:101)
	at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165)
	at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200)
	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:318)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:155)
	at foo.invokeOther3:run(foo.rb:3)
	at foo.RUBY$script(foo.rb:3)
	at java.lang.invoke.LambdaForm$DMH/989110044.invokeStatic_L7_L(LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$BMH/1182461167.reinvoke(LambdaForm$BMH)
	at java.lang.invoke.LambdaForm$MH/198761306.invoker(LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/1058025095.invokeExact_MT(LambdaForm$MH)
	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
	at org.jruby.ir.Compiler$1.load(Compiler.java:95)
	at org.jruby.Ruby.runScript(Ruby.java:828)
	at org.jruby.Ruby.runNormally(Ruby.java:747)
	at org.jruby.Ruby.runNormally(Ruby.java:765)
	at org.jruby.Ruby.runFromMain(Ruby.java:578)
	at org.jruby.Main.doRunFromMain(Main.java:417)
	at org.jruby.Main.internalRun(Main.java:305)
	at org.jruby.Main.run(Main.java:232)
	at org.jruby.Main.main(Main.java:204)

real	0m41.864s
user	0m51.344s
sys	0m0.495s
Returned 1.
@enebo
Copy link
Member

@enebo enebo commented Nov 24, 2017

Wow this one is pretty odd. It does not happen without the second file??? I really expected this to be something where the next somehow is missing hitting a matching push to our pop but it looks like it works ok if called from a lexical parent but not if it is not.

@enebo enebo added the ir label Nov 24, 2017
@enebo enebo added this to the JRuby 9.1.15.0 milestone Nov 24, 2017
@andrewdotn
Copy link
Author

@andrewdotn andrewdotn commented Nov 24, 2017

It doesn’t seem related to lexical scope, but to compiling. If I disable compilation, a second file isn’t needed:

$ cat foo.rb
while true
  begin
    raise "oops"
  rescue => e
    next
  end
end
$ jruby -J-Djruby.ir.debug -J-Djruby.compile.mode=OFF foo.rb 2>&1 | grep RPCs=1000
[main] INFO Interpreter : I: {15} toggle_backtrace(;true); <#RPCs=1000>
...

@enebo
Copy link
Member

@enebo enebo commented Nov 24, 2017

@andrewdotn ah yeah we compile main file by default. Thanks for that update.

@andrewdotn
Copy link
Author

@andrewdotn andrewdotn commented Nov 24, 2017

I also see leaks with

while true
  while true
    begin
      raise "oops"
    rescue => e
      break
    end
  end
end

and

while true
  begin
    raise "oops"
  rescue => e
    redo
  end
end

I’ve been trying to make IRBuilder add an extra ExceptionRegionEndMarkerInstr before the jump in these cases, but while they stop the leaks on the simple examples here, it causes real applications to break later in CFG, and I haven’t dug into that yet.

@headius
Copy link
Member

@headius headius commented Nov 27, 2017

This seems like a basic VM-level issue in our IR interpreter. The logic as compiled into JVM bytecode works just fine, because the management of try/catch regions is handled properly by the JVM. Any branch out of the "try" area automatically leaves that area, while for us it must be done manually at each exit point.

@enebo
Copy link
Member

@enebo enebo commented Nov 27, 2017

At first glance we create begin/end exc range which is great for building our CFG but then we try to leverage these as actual usable instrs in our startup interp, Their design is not really intended to pair up (e.g. start called then end called) as they are really for marking regions.

This is really only a problem in these cases that are infinite. It is also mildly ironic that we do not compile methods like this because they never exit so it is sort of an unhappy combination.

@enebo
Copy link
Member

@enebo enebo commented Nov 27, 2017

More notes. I tried pushing a new start exc only if it was different than top of stack (based on idea that lexically you could never raise to the same place twice) but this had issues with us prematurely popping too many elements. I cannot just ignore empty stack because I might have prematurely nuked a rescue ip we need.

@subbuss suggested another hack would be for start/end to both record the label they pair with and make sure to only pop if the push label matches.

@subbuss
Copy link
Contributor

@subbuss subbuss commented Nov 27, 2017

Deleted my previous comment because I don't think that simplified logic is right ... but yes, this is a problem of retrofitting the startup interpreter on top of the IR meant for CFG construction. The hacky soln should work for now. Will let @enebo test it. If there is a cleaner non-hacky soln that we can think of, we can use it instead.

@enebo enebo closed this in 98d1074 Nov 27, 2017
enebo added a commit that referenced this issue Nov 27, 2017
The comment in this commit hopefully spells out how the solution works but one
tl;dr is we cannot guarantee start and end exception region instructions will
occur in the presence of branches or jumps.  The workaround is to clean up the
stack whenever we decide to add a new exception region.

This had a small side-benefit of removing a hack we must have added when seeing
a jump-like instr just doing a pop and hoping for the best.
@enebo
Copy link
Member

@enebo enebo commented Nov 27, 2017

@andrewdotn since you were digging in last week I take it you can test this and give it the gold seal of approval :)

Looking at this problem in retrospect it seems obvious our solution was pretty lacking but we would only be able to observe this in an endless loop function which was not the main program file (which we compile by default). Even a huge finite loop I bet took a long time to OOM.

enebo added a commit that referenced this issue Nov 27, 2017
enebo added a commit that referenced this issue Nov 27, 2017
@headius
Copy link
Member

@headius headius commented Nov 27, 2017

@andrewdotn Can you provide a test or spec that exercises this? Probably as part of either spec/compiler/general_spec.rb or something in test/jruby.

@enebo
Copy link
Member

@enebo enebo commented Nov 27, 2017

I don't think this is amenable to testing. Even with 32m heap it takes like 40 seconds to OOME

@headius
Copy link
Member

@headius headius commented Nov 28, 2017

@enebo We can write a test that checks the stack size, though. Maybe it should be in Java.

@enebo
Copy link
Member

@enebo enebo commented Nov 28, 2017

@headius it is a local variable.

@headius
Copy link
Member

@headius headius commented Nov 28, 2017

@enebo Ahh I didn't notice it was not part of the interpreter's instance state.

I tweaked the given script a bit to reduce allocation and eliminate backtrace generation, and it only takes about 5s to OOM with -Xmx32m.

args = [Exception, "foo", []]
while true
  begin
    raise *args
  rescue java::lang::OutOfMemoryError
    exit 1
  rescue Exception
    next
  end
end

It's still a bit clunky for a test.

@andrewdotn
Copy link
Author

@andrewdotn andrewdotn commented Nov 28, 2017

Wow, thanks for the amazing turnaround time on this! I can confirm that your fix works. When I was debugging this, I added a check inside StartupInterpreterEngine.interpret() that would panic if rescuePCs grew larger than 100 elements. I think there could be a Java unit test along those lines, e.g., by checking that after 100 loops rescuePCs was a sane size. It would need rescuePCs to be exposed for testing.

If I was adding a test for my own code here, and I wasn’t worried about performance, I’d add a Consumer<Integer> rescuePCsListenerForTesting parameter, along with an overload that would keep the existing signature by defaulting the new parameter to null. If non-null, the consumer would be passed the size of rescuePCs on every push, so that the test could verify it didn’t grow excessively during the loop. I know that would be inelegant but it would be fast and it would work.

enebo added a commit that referenced this issue Nov 28, 2017
…understand

in original fix is that GEB and exception region for exceptions raised in
ensures is that the all push the same label and nest.  The fix is simple we
now capture the instr itself since it is unique and continue using the pruning
technique.  The original solutions only mistake was not realizing we would
nest regions to the same destination.
enebo added a commit that referenced this issue Nov 28, 2017
…understand

in original fix is that GEB and exception region for exceptions raised in
ensures is that the all push the same label and nest.  The fix is simple we
now capture the instr itself since it is unique and continue using the pruning
technique.  The original solutions only mistake was not realizing we would
nest regions to the same destination.
@enebo
Copy link
Member

@enebo enebo commented Nov 28, 2017

Ok so I found another issue but my confidence is quite a bit higher now. The reason I see no value in testing the stack size for unbounded growth is the algorithm on insertion will look to see if it is already on the stack and if so potentially reduce the size of the stack (but only grow if not present). So there should never be unbounded growth from that alone. However, we only bother to add these stack elements when we traverse a lexical section of code. I am super confident we will never be more than n lexical nestings of stack size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants