Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE in ruby block called concurrently from Java #6282

Closed
jmalves opened this issue Jun 15, 2020 · 8 comments · Fixed by #6288
Closed

NPE in ruby block called concurrently from Java #6282

jmalves opened this issue Jun 15, 2020 · 8 comments · Fixed by #6288

Comments

@jmalves
Copy link
Contributor

jmalves commented Jun 15, 2020

Environment Information

  • JRuby 9.2.7.0

Expected Behavior

No error when calling ruby block concurrently from Java

I have not been able to reproduce this error locally at all but seems to happen sporadically in our testing pipeline. The failing tests are different but a common pattern is passing a ruby block to java which is called concurrently. An example similar to our tests could be:

 results = java.util.concurrent.ConcurrentHashMap.new
 calculate_something_in_parallel(input1, input2, -> (id, result) { results[id] = result })

The stack trace then indicates that it fails when calling the lambda.

Actual Behavior

Returns an NPE with the following stack trace:

java.lang.NullPointerException 
     org.jruby.runtime.InterpretedIRBlockBody.commonYieldPath(InterpretedIRBlockBody.java:127)
     org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:79)
     org.jruby.runtime.Block.call(Block.java:124)
     org.jruby.RubyProc.call(RubyProc.java:295)
     org.jruby.RubyProc.call(RubyProc.java:274)
     org.jruby.RubyProc.call(RubyProc.java:270)
     org.jruby.javasupport.Java$ProcToInterface.callProc(Java.java:1131)
     org.jruby.javasupport.Java$ProcToInterface.access$300(Java.java:1108)
     org.jruby.javasupport.Java$ProcToInterface$ConcreteMethod.call(Java.java:1169)
     org.jruby.gen.InterfaceImpl630006984.consume(org/jruby/gen/InterfaceImpl630006984.gen:13)
@headius headius added this to the JRuby 9.3.0.0 milestone Jun 15, 2020
@headius
Copy link
Member

headius commented Jun 15, 2020

Trying to reproduce with this script on 9.2.7.0, using -X-C -Xjit.threshold=1 but no love so far:

loop do
  block = eval "proc { i = 0; 100000.times { i += 1 } }"
  100.times.map {
    Thread.new { 100.times { block.call } }
  }.each(&:join)
end

@headius
Copy link
Member

headius commented Jun 15, 2020

Ahh, @jmalves pointed out my example does not hit the same code path (commonYieldPath). Since it's a trivial block, it goes along the "direct" path.

This code is better, but still doesn't repro for me:

loop do
  ary = []
  block = eval "proc { ary[0] = 1 }"
  100.times.map {
    Thread.new { 100.times { block.call } }
  }.each(&:join)
end

The "ary[0] = 1" forces a frame (due to some core []= methods requiring frame access) which causes this logic to go down the non-direct path.

@jmalves
Copy link
Contributor Author

jmalves commented Jun 15, 2020

@headius got a repro! Here is my a bit silly code.

In java I have:

public static void pimpam(Consumer<Integer> consumer) {
        List<CompletableFuture<?>> list = new ArrayList<>(50);
        for(int i = 50; i > 0; i--) {
            final int x = i;
            list.add(CompletableFuture.runAsync(() -> {
                for (int j = 10; j > 0; j--) {
                    consumer.accept(j * j + x * x);
                }
            }));
        }

        try {
          list.forEach(CompletableFuture::join);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

In ruby I run:

    50_000.times do |i|
      results = java.util.concurrent.ConcurrentHashMap.new
      pimpam(eval " ->(iii) { results[iii] = iii } ")
      p i if i % 1000 == 0
    end

@jmalves
Copy link
Contributor Author

jmalves commented Jun 16, 2020

@headius I found a way that reproduces this bug very rarely and I also found another sporadic error in the process. I attach in a zip the files for this reproduction. To run them: javac Repo.java and then jruby --dev script.rb

What I found was that it seems that the bug only reproduces using the "--dev" flag. Tested this flag with the snippet you pasted here and it did not seem to work for me (left it running ~15min)

The new error I found running my script has the following stacktrace and it seems to appear commonly:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.jruby.runtime.scope.ManyVarsDynamicScope.setValueDepthZeroVoid(ManyVarsDynamicScope.java:183)
at org.jruby.runtime.scope.ManyVarsDynamicScope.setValueVoid(ManyVarsDynamicScope.java:174)
at org.jruby.ir.interpreter.InterpreterEngine.setResult(InterpreterEngine.java:572)
at org.jruby.ir.interpreter.InterpreterEngine.receiveArg(InterpreterEngine.java:273)
at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:68)
at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)
at org.jruby.runtime.InterpretedIRBlockBody.callDirect(InterpretedIRBlockBody.java:102)
at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)
at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:52)
at org.jruby.runtime.Block.call(Block.java:139)
at org.jruby.RubyProc.call(RubyProc.java:318)
at org.jruby.javasupport.Java$ProcToInterface.callProc(Java.java:1137)
at org.jruby.javasupport.Java$ProcToInterface.access$300(Java.java:1114)
at org.jruby.javasupport.Java$ProcToInterface$ConcreteMethod.call(Java.java:1175)
at org.jruby.gen.InterfaceImpl714518509.accept(org/jruby/gen/InterfaceImpl714518509.gen:13)
at Repro.lambda$pimpam$0(Repro.java:18)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)

I tested with 9.2.7.0 and 9.2.11.0, curiously enough I was only able to get the NPE when running with 9.2.7.0 but might just be a question of leaving 9.2.11.0 running longer.

reproduction.zip

@headius headius modified the milestones: JRuby 9.3.0.0, JRuby 9.2.12.0 Jun 17, 2020
@headius
Copy link
Member

headius commented Jun 17, 2020

@jmalves This is great, thank you. As I mentioned on Matrix, I am now able to reproduce both errors against 9.2.7.0.

However so far I have not been able to reproduce the NPE against 9.2.11+. The AIOOB reproduces on both jruby-9.2 and master branches.

Do you think you could try to confirm 9.2.11 or a build off the 9.2.12 branch and see if you can confirm the NPE does not show up? This wouldn't mean it's guaranteed to be fixed, but it at least hints at it.

I will proceed with fixing the AIOOB, which may end up fixing any rare NPE or other concurrency issues in the same place.

@headius
Copy link
Member

headius commented Jun 18, 2020

Ok, I believe I have a patch that deals with the AIOOB, and may be additional insurance against the NPE. I'll push a PR and we'll see how it looks.

Incidentally I also tried to reproduce this using a method rather than a block, but I did not see any errors result.

obj = Object.new
50_000.times do |i|
  $results = java.util.concurrent.ConcurrentHashMap.new
  eval " def obj.accept(iii); $results[iii] = iii; end "
  Java::Blah.pimpam(obj)
  p i if i % 1000 == 0
end

headius added a commit to headius/jruby that referenced this issue Jun 18, 2020
The logic here used to go to interpreterContext to check for the
call protocol logic. The hasExplicitCallProtocol method returns a
cached value acquired from the encapsulated scope, via its flags.
However the enum of flags in IRScope may be mutated concurrently,
leading to a situation where we have seen the scope flip to having
call protocols installed, but the InterpreterContext in hand is
still the one without protocol instructions. As a result, we may
start interpreting using an unoptimized IR along the "direct" path
which does not push any scope state. This lead to the AIOOB
reported later in jruby#6282.

The NPE reported in jruby#6282 has not reproduced for me on the
jruby-9.2 branch, and may have been fixed through other means, but
this modified logic should also help any such cases that remain.
Instead of overwriting the two InterpreterContext fields, the non-
full version will always remain the non-full version, and the full
version will only be set into that field. Once set, neither are
ever cleared or re-set to a new value, avoiding a transient null
in the instructionContext field.

I ran the new reproduction from jruby#6282 for a few hours and saw no
errors. Adding a test that guarantees this is fixed may be
difficult, but I believe we've got it here.

This logic will be overhauled for 9.3, so we'll go with this
workaround for now and call it fixed.

Fixes jruby#6282.
headius added a commit to headius/jruby that referenced this issue Jun 18, 2020
The logic here used to go to interpreterContext to check for the
call protocol logic. The hasExplicitCallProtocol method returns a
cached value acquired from the encapsulated scope, via its flags.
However the enum of flags in IRScope may be mutated concurrently,
leading to a situation where we have seen the scope flip to having
call protocols installed, but the InterpreterContext in hand is
still the one without protocol instructions. As a result, we may
start interpreting using an unoptimized IR along the "direct" path
which does not push any scope state. This lead to the AIOOB
reported later in jruby#6282.

The NPE reported in jruby#6282 has not reproduced for me on the
jruby-9.2 branch, and may have been fixed through other means, but
this modified logic should also help any such cases that remain.
Instead of overwriting the two InterpreterContext fields, the non-
full version will always remain the non-full version, and the full
version will only be set into that field. Once set, neither are
ever cleared or re-set to a new value, avoiding a transient null
in the instructionContext field.

I ran the new reproduction from jruby#6282 for a few hours and saw no
errors. Adding a test that guarantees this is fixed may be
difficult, but I believe we've got it here.

This logic will be overhauled for 9.3, so we'll go with this
workaround for now and call it fixed.

Fixes jruby#6282.
@headius
Copy link
Member

headius commented Jun 18, 2020

This is now fixed, but I'm not sure if there's a good way to test it. In any case, JRuby 9.3 will do a much larger-scale cleanup of these transitions, so we avoid having jit threads stepping on executing threads.

@headius headius closed this as completed Jun 18, 2020
@headius headius linked a pull request Jun 18, 2020 that will close this issue
@jmalves
Copy link
Contributor Author

jmalves commented Jun 22, 2020

Do you think you could try to confirm 9.2.11 or a build off the 9.2.12 branch and see if you can confirm the NPE does not show up? I was not able to reproduce the NPE in 9.2.11

Thanks for this patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants