New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singleton on non-persistent Java type Java::JavaLang::NullPointerException #4304

Closed
tobymurray-nanometrics opened this Issue Nov 16, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@tobymurray-nanometrics

tobymurray-nanometrics commented Nov 16, 2016

We just upgraded from 9.1.2.0 to 9.1.6.0, and we saw a CI build failure with the exception below. It's only failed once so far, but we like to stay on top of these things. Look familiar at all? Anything we can do to provide more helpful logs/details? I don't have a reproducible case, so I'm just looking for some guidance as to whether you think this is an us problem or a JRuby 9.1.6.0 problem.

Environment

JRuby 1.9.6.0
Rails 4.2.7.1

openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
Linux 468c25bf6746 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 GNU/Linux

Command executed

[INFO] Running precompile assets with command java -Xms2g -Xmx2g -jar /root/.m2/repository/org/jruby/jruby-complete/9.1.6.0/jruby-complete-9.1.6.0.jar -S /opt/atlassian/pipelines/agent/build/athena/.gems/bin/rake assets:precompile

Stacktrace

I, [2016-11-16T16:48:29.889000 #387]  INFO -- : Writing /opt/atlassian/pipelines/agent/build/athena/public/assets/pages/threshold_statuses/seismicity_plot-06f1df67a58d8f1dcfeddae2bab4912e0248623dce7f1b2b7927dffb0d1e13ce.js.gz
/opt/atlassian/pipelines/agent/build/athena/.gems/gems/rake-11.3.0/lib/rake/task.rb:197: warning: singleton on non-persistent Java type Java::JavaLang::NullPointerException (http://wiki.jruby.org/Persistence)
rake aborted!
Java::JavaLang::NullPointerException: 
org.jruby.ir.dataflow.DataFlowProblem.compute_MOP_Solution(DataFlowProblem.java:56)
org.jruby.ir.passes.LiveVariableAnalysis.execute(LiveVariableAnalysis.java:77)
org.jruby.ir.passes.CompilerPass.run(CompilerPass.java:111)
org.jruby.ir.passes.CompilerPass.makeSureDependencyHasRunOnce(CompilerPass.java:133)
org.jruby.ir.passes.CompilerPass.run(CompilerPass.java:101)
org.jruby.ir.passes.CompilerPass.run(CompilerPass.java:125)
org.jruby.ir.IRScope.runCompilerPasses(IRScope.java:502)
org.jruby.ir.IRScope.prepareForCompilation(IRScope.java:586)
org.jruby.runtime.MixedModeIRBlockBody.promoteToFullBuild(MixedModeIRBlockBody.java:169)
org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:126)
org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:182)
org.jruby.runtime.BlockBody.yield(BlockBody.java:110)
org.jruby.runtime.Block.yield(Block.java:167)
org.jruby.RubyArray.collectBang(RubyArray.java:2505)
org.jruby.RubyArray.map_bang(RubyArray.java:2524)
org.jruby.RubyArray$INVOKER$i$0$0$map_bang.call(RubyArray$INVOKER$i$0$0$map_bang.gen)
org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:497)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:77)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:83)
org.jruby.ir.instructions.CallBase.interpret(CallBase.java:428)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:356)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:73)
org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:109)
org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:95)
org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:174)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:67)
opt.atlassian.pipelines.agent.build.athena.$_dot_gems.gems.sass_minus_3_dot_4_dot_22.lib.sass.selector.sequence.invokeOther2:do_extend(/opt/atlassian/pipelines/agent/build/athena/.gems/gems/sass-3.4.22/lib/sass/selector/sequence.rb:112)
opt.atlassian.pipelines.agent.build.athena.$_dot_gems.gems.sass_minus_3_dot_4_dot_22.lib.sass.selector.sequence.RUBY$block$do_extend$0(/opt/atlassian/pipelines/agent/build/athena/.gems/gems/sass-3.4.22/lib/sass/selector/sequence.rb:112)
org.jruby.runtime.CompiledIRBlockBody.yieldDirect(CompiledIRBlockBody.java:156)
org.jruby.runtime.MixedModeIRBlockBody.yieldDirect(MixedModeIRBlockBody.java:122)
org.jruby.runtime.BlockBody.yield(BlockBody.java:108)
org.jruby.runtime.Block.yield(Block.java:167)
org.jruby.RubyArray.collect(RubyArray.java:2478)
org.jruby.RubyArray.map19(RubyArray.java:2492)
org.jruby.RubyArray$INVOKER$i$0$0$map19.call(RubyArray$INVOKER$i$0$0$map19.gen)
org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:497)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:77)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:83)
org.jruby.ir.instructions.CallBase.interpret(CallBase.java:428)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:356)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:73)
org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:109)
org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:95)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:77)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:83)
org.jruby.ir.instructions.CallBase.interpret(CallBase.java:428)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:356)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:73)
org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:132)
org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:148)
org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:182)
org.jruby.runtime.BlockBody.yield(BlockBody.java:110)
org.jruby.runtime.Block.yield(Block.java:167)
org.jruby.specialized.RubyArrayOneObject.collect(RubyArrayOneObject.java:90)
org.jruby.RubyArray.map19(RubyArray.java:2492)
org.jruby.RubyArray$INVOKER$i$0$0$map19.call(RubyArray$INVOKER$i$0$0$map19.gen)
org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:497)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:77)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:83)
org.jruby.ir.instructions.CallBase.interpret(CallBase.java:428)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:356)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:73)
org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:90)
org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:214)
org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:200)
org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:208)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:193)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:324)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:73)
org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84)
org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179)
org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165)
org.jruby.RubyClass.finvoke(RubyClass.java:531)
org.jruby.RubyBasicObject.send19(RubyBasicObject.java:1668)
org.jruby.RubyKernel.send19(RubyKernel.java:1991)
org.jruby.RubyKernel$INVOKER$s$send19.call(RubyKernel$INVOKER$s$send19.gen)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:203)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:209)
opt.atlassian.pipelines.agent.build.athena.$_dot_gems.gems.sass_minus_3_dot_4_dot_22.lib.sass.tree.visitors.base.invokeOther7:send(/opt/atlassian/pipelines/agent/build/athena/.gems/gems/sass-3.4.22/lib/sass/tree/visitors/base.rb:36)
opt.atlassian.pipelines.agent.build.athena.$_dot_gems.gems.sass_minus_3_dot_4_dot_22.lib.sass.tree.visitors.base.RUBY$method$visit$0(/opt/atlassian/pipelines/agent/build/athena/.gems/gems/sass-3.4.22/lib/sass/tree/visitors/base.rb:36)
...

@enebo enebo added this to the JRuby 9.1.7.0 milestone Nov 16, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 13, 2016

Member

It appears the worklist for the data flow analysis problem is getting populated with a null. That appears to possible if DataFlowProblem.getFlowGraphNode returns null in generateWorkList. That in turn can happen if the given block is not in the basicBlockToFlowGraph collection, or if a null gets inserted into that collection in buildFlowGraph. Looking at the logic there, it does not appear possible that a null is being inserted, so I suspect the worklist is traversing a basic block that is not part of the control flow graph.

Member

headius commented Dec 13, 2016

It appears the worklist for the data flow analysis problem is getting populated with a null. That appears to possible if DataFlowProblem.getFlowGraphNode returns null in generateWorkList. That in turn can happen if the given block is not in the basicBlockToFlowGraph collection, or if a null gets inserted into that collection in buildFlowGraph. Looking at the logic there, it does not appear possible that a null is being inserted, so I suspect the worklist is traversing a basic block that is not part of the control flow graph.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 13, 2016

Member

Here's a patch that would at least prevent that null entry in the worklist, but I'm not sure why the DF problem tries to traverse a block that is no longer in the CFG. Perhaps this is related to the warning about caching the post-order traversal as @subbuss mentioned in CFG.postOrderList?

diff --git a/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java b/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
index 6a46213..aa0fcfd 100644
--- a/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
+++ b/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
@@ -67,7 +67,9 @@ public abstract class DataFlowProblem<T extends DataFlowProblem<T, U>, U extends
                 scope.getCFG().getReversePostOrderTraverser() : scope.getCFG().getPostOrderTraverser();
 
         while (it.hasNext()) {
-            wl.add(getFlowGraphNode(it.next()));
+            U flowGraphNode = getFlowGraphNode(it.next());
+            if (flowGraphNode == null) continue;
+            wl.add(flowGraphNode);
         }
 
         return wl;
Member

headius commented Dec 13, 2016

Here's a patch that would at least prevent that null entry in the worklist, but I'm not sure why the DF problem tries to traverse a block that is no longer in the CFG. Perhaps this is related to the warning about caching the post-order traversal as @subbuss mentioned in CFG.postOrderList?

diff --git a/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java b/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
index 6a46213..aa0fcfd 100644
--- a/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
+++ b/core/src/main/java/org/jruby/ir/dataflow/DataFlowProblem.java
@@ -67,7 +67,9 @@ public abstract class DataFlowProblem<T extends DataFlowProblem<T, U>, U extends
                 scope.getCFG().getReversePostOrderTraverser() : scope.getCFG().getPostOrderTraverser();
 
         while (it.hasNext()) {
-            wl.add(getFlowGraphNode(it.next()));
+            U flowGraphNode = getFlowGraphNode(it.next());
+            if (flowGraphNode == null) continue;
+            wl.add(flowGraphNode);
         }
 
         return wl;
@subbuss

This comment has been minimized.

Show comment
Hide comment
@subbuss

subbuss Dec 13, 2016

Contributor

Could be. Not sure if the list is updated when a BB is deleted for whatever reason. And there is code that deletes BBs as far as I remember.

Contributor

subbuss commented Dec 13, 2016

Could be. Not sure if the list is updated when a BB is deleted for whatever reason. And there is code that deletes BBs as far as I remember.

@tobymurray-nanometrics

This comment has been minimized.

Show comment
Hide comment
@tobymurray-nanometrics

tobymurray-nanometrics Dec 13, 2016

For what it's worth, looking over the last week+ we haven't seen this re-occur. It's certainly not blocking us on anything in its current state.

For what it's worth, looking over the last week+ we haven't seen this re-occur. It's certainly not blocking us on anything in its current state.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Dec 13, 2016

Member

@headius @subbuss I think a null check will wallpaper over the actual problem.

So I figured out two races:

  1. we promote to full build on parent scope but then on another thread we promote a child closure scope to full build. Since they happen on different threads AND CFG is a mutable data structure we end up with nonexistent BBs. Either parent or child can also be JITing and deciding to do LVA or anything which changes the CFG.
  2. if jit.background=false we compile any two things at same time across threads. Same parent/child relationship can occur.

I think this ends up being fairly rare because the race is not probably all that common. You need a closure and a parent (which could be closure/method/whatever) deciding to build at the same time but from different threads. Even if that does happen the CFG might not change in any meaningful way.

CFG should be immutable? That certainly would get rid of the most obvious reported issue but is it a correct solution? If I LVA on a parent and it assumes something about the child but the child is optimizing and changing itself can we depend on the older CFG? Maybe for LVA (conservatively speaking) but in general?

Member

enebo commented Dec 13, 2016

@headius @subbuss I think a null check will wallpaper over the actual problem.

So I figured out two races:

  1. we promote to full build on parent scope but then on another thread we promote a child closure scope to full build. Since they happen on different threads AND CFG is a mutable data structure we end up with nonexistent BBs. Either parent or child can also be JITing and deciding to do LVA or anything which changes the CFG.
  2. if jit.background=false we compile any two things at same time across threads. Same parent/child relationship can occur.

I think this ends up being fairly rare because the race is not probably all that common. You need a closure and a parent (which could be closure/method/whatever) deciding to build at the same time but from different threads. Even if that does happen the CFG might not change in any meaningful way.

CFG should be immutable? That certainly would get rid of the most obvious reported issue but is it a correct solution? If I LVA on a parent and it assumes something about the child but the child is optimizing and changing itself can we depend on the older CFG? Maybe for LVA (conservatively speaking) but in general?

@tobymurray-nanometrics

This comment has been minimized.

Show comment
Hide comment
@tobymurray-nanometrics

tobymurray-nanometrics Jan 9, 2017

After a month of no issues (hundreds of successful builds), we just saw another occurrence of this - almost exactly the same place, during asset precompilation. Exactly the same stacktrace as far as I can tell, although this time running Rake 12.0.0.

tobymurray-nanometrics commented Jan 9, 2017

After a month of no issues (hundreds of successful builds), we just saw another occurrence of this - almost exactly the same place, during asset precompilation. Exactly the same stacktrace as far as I can tell, although this time running Rake 12.0.0.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 9, 2017

Member

@tobymurray-nanometrics Thanks for the update. We're still debating the right fix for this.

Member

headius commented Jan 9, 2017

@tobymurray-nanometrics Thanks for the update. We're still debating the right fix for this.

@enebo enebo modified the milestones: JRuby 9.1.8.0, JRuby 9.1.7.0 Jan 10, 2017

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Jan 10, 2017

Member

This is bad bug but I am pushing to 9.1.8. I will land my PR after 9.1.7.0 and we can beat on or improve the obvious concurrency issue we have when compiling child closures before their parents. The worst part of this issue is my inability to get a good test case.

Member

enebo commented Jan 10, 2017

This is bad bug but I am pushing to 9.1.8. I will land my PR after 9.1.7.0 and we can beat on or improve the obvious concurrency issue we have when compiling child closures before their parents. The worst part of this issue is my inability to get a good test case.

@tobymurray-nanometrics

This comment has been minimized.

Show comment
Hide comment
@tobymurray-nanometrics

tobymurray-nanometrics Jan 10, 2017

No complaints here from pushing the bug. It has only manifested for us during CI builds, even then only twice that I can find. and we can just run the build over again in the meantime. The asset precompilation stage of our build (the only place we've seen this) is pretty opaque, but is there anything we can do to help here? Any additional logs or information that would help build a reproduction?

No complaints here from pushing the bug. It has only manifested for us during CI builds, even then only twice that I can find. and we can just run the build over again in the meantime. The asset precompilation stage of our build (the only place we've seen this) is pretty opaque, but is there anything we can do to help here? Any additional logs or information that would help build a reproduction?

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Jan 10, 2017

Member

@tobymurray-nanometrics if I had a smallish script I could torture test which would manifest on occasion it would definitely help. My other thought is I could write Java which explicitly calls into our compiler from two threads to make the race much more likely. On your part, I am pretty sure I understand the actual issue. It is verifying any fix I have works which is harder part. My PR probably fixes the problem too although I want it to bake first.

Member

enebo commented Jan 10, 2017

@tobymurray-nanometrics if I had a smallish script I could torture test which would manifest on occasion it would definitely help. My other thought is I could write Java which explicitly calls into our compiler from two threads to make the race much more likely. On your part, I am pretty sure I understand the actual issue. It is verifying any fix I have works which is harder part. My PR probably fixes the problem too although I want it to bake first.

@tobymurray-nanometrics

This comment has been minimized.

Show comment
Hide comment
@tobymurray-nanometrics

tobymurray-nanometrics Jan 10, 2017

I can spend a little time torturing the precompile stage to see if it consistently fails occasionally, and if that works I can post it. Any ideas as to what may contribute to eliciting this (e.g. does RAM allocated to the process have any impact etc.)?

tobymurray-nanometrics commented Jan 10, 2017

I can spend a little time torturing the precompile stage to see if it consistently fails occasionally, and if that works I can post it. Any ideas as to what may contribute to eliciting this (e.g. does RAM allocated to the process have any impact etc.)?

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Jan 10, 2017

Member

@tobymurray-nanometrics In theory anything which has more than one thread calling a parent scope (closure or method) and a child closure from the other thread. The error condition is explained in the PR explanation. Largely it is both compiling at the same time on different threads. The parent thinks something on the child is usable but then by the time it asks for it again it has changed.

Member

enebo commented Jan 10, 2017

@tobymurray-nanometrics In theory anything which has more than one thread calling a parent scope (closure or method) and a child closure from the other thread. The error condition is explained in the PR explanation. Largely it is both compiling at the same time on different threads. The parent thinks something on the child is usable but then by the time it asks for it again it has changed.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 17, 2017

Member

@tobymurray-nanometrics Just so you know, PR #4392 was merged to master a couple days ago, so master builds should have our expected fix for this. If you're able to test it out that would help us verify that the issue is fixed.

Member

headius commented Jan 17, 2017

@tobymurray-nanometrics Just so you know, PR #4392 was merged to master a couple days ago, so master builds should have our expected fix for this. If you're able to test it out that would help us verify that the issue is fixed.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Feb 14, 2017

Member

Resolving based on feedback in a related issue. @tobymurray-nanometrics if you still see this then please reopen it.

Member

enebo commented Feb 14, 2017

Resolving based on feedback in a related issue. @tobymurray-nanometrics if you still see this then please reopen it.

@enebo enebo closed this Feb 14, 2017

@enebo enebo added the ir label Feb 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment