New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jruby > 9.1.13.0 crashing on bundler when certain jvm flags are enabled #5138

Closed
jsvd opened this Issue Apr 12, 2018 · 22 comments

Comments

Projects
None yet
4 participants
@jsvd

jsvd commented Apr 12, 2018

The following crashes on jruby 9.1.16.0 but not on 9.1.13.0:

wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile
wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile.lock
GEM_HOME=vendor/bundle/jruby/2.3.0 GEM_PATH=vendor/bundle/jruby/2.3.0 bin/jgem install bundler -v "1.9.10"                                   
GEM_HOME=vendor/bundle/jruby/2.3.0 GEM_PATH=vendor/bundle/jruby/2.3.0 JRUBY_OPTS="-X-C -J-Djruby.jit.threshold=0" ./bin/jruby -rbundler -e Bundler.setup

JRuby 9.1.16.0 raises a NoMethodError exception on a nil object while 9.1.13.0 works correctly (i.e. expected Bundler::GemNotFound exception).

This happens only when both -X-C and -J-Djruby.jit.threshold=0 are enabled. Dropping either one will trigger the normal GemNotFound exception.
Also, a newer bundler (like 1.16.1) doesn't trigger the NoMethodError, just the old 1.9.10.

With jruby-9.1.13.0:

/tmp/jruby-9.1.13.0 % wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile
wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile.lock
GEM_HOME=vendor/bundle/jruby/2.3.0 GEM_PATH=vendor/bundle/jruby/2.3.0 bin/jgem install bundler -v "1.9.10"                 
GEM_HOME=vendor/bundle/jruby/2.3.0 GEM_PATH=vendor/bundle/jruby/2.3.0 JRUBY_OPTS="-X-C -J-Djruby.jit.threshold=0" ./bin/jruby -rbundler -e Bundler.setup

Fetching: bundler-1.9.10.gem (100%)
Successfully installed bundler-1.9.10
1 gem installed
Bundler::GemNotFound: Could not find rake-12.3.1 in any of the sources
  block in materialize at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/spec_set.rb:92
                  map! at org/jruby/RubyArray.java:2518
           materialize at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/spec_set.rb:85
                 specs at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:132
             specs_for at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:177
       requested_specs at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:166
       requested_specs at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/environment.rb:18
                 setup at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/runtime.rb:13
                 setup at /private/tmp/jruby-9.1.13.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler.rb:122
                <main> at -e:1

But with jruby-9.1.16.0:

 /tmp/jruby-9.1.16.0 % wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile
wget --quiet https://gist.githubusercontent.com/jsvd/d15dc213a969f286ec18b65e7dceb725/raw/573e4a340c699135ef9880a4e3c75de00deb55e5/Gemfile.lock
GEM_HOME=vendor/bundle/jruby/2.3.0/ GEM_PATH=vendor/bundle/jruby/2.3.0/ bin/jgem install bundler -v "1.9.10"               
GEM_HOME=vendor/bundle/jruby/2.3.0 GEM_PATH=vendor/bundle/jruby/2.3.0 JRUBY_OPTS="-X-C -J-Djruby.jit.threshold=0" ./bin/jruby -rbundler -e Bundler.setup

Fetching: bundler-1.9.10.gem (100%)
Successfully installed bundler-1.9.10
1 gem installed
NoMethodError: undefined method `first' for nil:NilClass
                    for at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/spec_set.rb:22
  converge_locked_specs at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:554
                resolve at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:187
                  specs at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:132
              specs_for at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:177
        requested_specs at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/definition.rb:166
        requested_specs at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/environment.rb:18
                  setup at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler/runtime.rb:13
                  setup at /private/tmp/jruby-9.1.16.0/vendor/bundle/jruby/2.3.0/gems/bundler-1.9.10/lib/bundler.rb:122
                 <main> at -e:1
% java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

This is the minimal reproduction scenario I could find, it is surfacing in Logstash in a more complex scenario: https://logstash-ci.elastic.co/job/elastic+logstash+pull-request+multijob-integration-2/294/console

@jsvd

This comment has been minimized.

jsvd commented Apr 12, 2018

stacktrace with -Xbacktrace.style=full can be found here

@headius

This comment has been minimized.

Member

headius commented Apr 12, 2018

Can you get us a full backtrace with -Xjit.threshold=0? It should show us where this "first" call is happening and we can sort out why it's nil.

@headius

This comment has been minimized.

Member

headius commented Apr 12, 2018

Oh also...Bundler 1.9.10 is several years old. Perhaps it simply doesn't work on newer Rubies? Or perhaps this should not be a priority to "fix" in JRuby if a newer Bundler works fine?

@jsvd

This comment has been minimized.

jsvd commented Apr 12, 2018

backtrace here

Oh also...Bundler 1.9.10 is several years old. Perhaps it simply doesn't work on newer Rubies? Or perhaps this should not be a priority to "fix" in JRuby if a newer Bundler works fine?

I understand it may not work with an old version, I just found it strange that jvm flags trigger a no method exception within a very obscure backtrace (there is no .first or [] in the line that raised the exception)

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Yes, it could indicate a problem that should be fixed.

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Sorry, the trace you gave me probably isn't showing what we need...but I have managed to reproduce locally!

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Ok, I've figured out the line that's actually causing this. It's in the same method, further down (like line 48 or so):

if spec = lookup['bundler'].first
  specs << spec
end

I'm not sure why the line number has not been updated for the trace; that would be a separate bug in IR, or possibly only in the full interpreter.

So it gets to this point, but lookup['bundler'] is returning a nil result for whatever reason (and presumably it shouldn't since we're obviously inside bundler).

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Ok, I think I can track the line number issue like this...

  1. It enters the until.
  2. It hits the next that's on the false line (23 or so) and exits the until.
  3. Something about the if a = b[].first ... logic does not properly update line number before it runs, so we still have the old line number.

That will need to be fixed as a separate issue.

By moving the lookup['bundler'] to a separate line, I get a correct trace.

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Yeah this seems like some obscure IR bug. I am having trouble locating it, because whatever logging code I add causes the error to move around.

@enebo Maybe you have an idea here? It seems like a variable is not getting assigned the right value, but I can't sort out why.

@headius headius added this to the JRuby 9.2.0.0 milestone Apr 13, 2018

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

@jsvd For what it's worth, this combination of flags probably isn't very useful to you. Is there a reason you used them?

@jsvd

This comment has been minimized.

jsvd commented Apr 13, 2018

@headius we're seeing this in the logstash plugin manager where we optimize for faster startup vs performance because it performs short tasks like install/list/remove plugins. The list of flags (which is more than those just two) was compiled from sources like https://github.com/jruby/jruby/wiki/Improving-startup-time

@enebo

This comment has been minimized.

Member

enebo commented Apr 13, 2018

@headius yeah the line number bug is probably a little juggling in IR builder. It could even be improper line number in ifnode in AST but as you said this is not that issue.

next handling I guess I need to run it as well to get more insight. Nothing pops out. It is interesting that JIT works and full interp fails. The thing is in a simple loop a next gets made into just a jump instr....but why would it get triggered if it is not supposed to be? Another wonderment...does this work in startup interpreter? That is pre-CFG and so boundaries are done differently but it may be a data point. We are talking about a jump or something weird with until vs loop (newer code uses loop + break vs until).

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

@enebo I'm not sure the bogus value has anything directly to do with the next or line number stuff. It seems to move around as I change the code; a value I'll confirm isn't nil from a called method will end up nil a line later in the variable. It's very strange.

@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Example: in the for method in Bundler 1.9.10's spec_set.rb, it makes a call to lookup to get a hash of specs. The following logging shows the type of value it returns before it returns. The final NilClass is the same value within for, literally the next line. It either never gets assigned properly to the local variable, or it never makes it back through the call pipeline. The former seems more likely, since the latter should not be affected significantly by IR mode.

In for:

...
      l = lookup
      p l.class
      var = l['bundler']
      first = var.first
      if spec = first
        specs << spec
      end
...

And in lookup in its entirety:

    def lookup
      @lookup ||= begin
        lookup = Hash.new { |h,k| h[k] = [] }
        specs = @specs.sort_by do |s|
          s.platform.to_s == 'ruby' ? "\0" : s.platform.to_s
        end
        specs.reverse_each do |s|
          lookup[s.name] << s
        end
        lookup
      end
      p "returning from lookup: " + @lookup.class.inspect
      @lookup
    end
@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Oh, and the output...

"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
"returning from lookup: Hash"
NilClass
NoMethodError: undefined method `[]' for nil:NilClass
                    for at /Users/headius/projects/jruby/lib/ruby/gems/shared/gems/bundler-1.9.10/lib/bundler/spec_set.rb:50
@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Debug IR output for the relevant section of for:

BB [23:LBL_3:84]
	%v_12 = copy(nil)
	line_num(;n: 47)
	l(0:9) = call_0o(%self ;n:lookup, t:VA, cl:false)
	line_num(;n: 48)
	%v_26 = call_0o(l(0:9) ;n:class, t:NO, cl:false)
	%v_27 = call_1o(%self, %v_26 ;n:p, t:FU, cl:false)
	line_num(;n: 49)
	var(0:10) = array_deref(l(0:9), frozen:"bundler" ;n:[], t:FU, cl:false)
	line_num(;n: 50)
	first(0:11) = call_0o(var(0:10) ;n:first, t:NO, cl:false)
	line_num(;n: 51)
	spec(0:8) = copy(first(0:11))
	b_false(LBL_11:102, first(0:11))
@headius

This comment has been minimized.

Member

headius commented Apr 13, 2018

Ok, after pairing with @enebo for a while we're working up a theory.

He bisected the regression to a CFG optimization in ffa0151.

We're still sorting out the details, but my current theory is that the loop eventually ends up with an empty basic block that just dumps off into another block which itself only holds a jump to the loop condition. Because the source block is empty, CFG.optimize crushes the target block into it and kills the target block...without checking that some other block might branch to it. That is the case here, because the next also branches to the condition. Now, with the condition branching block removed, the jump for next goes to the wrong instruction and starts executing after the call to lookup. Whatever the next call is, that's where it starts. That's why adding code before the lookup call often fixes it. It just gets skipped.

Here's my working patch to CFG.optimize that avoids removing a target block if there's more than one place branching to it. This works, but it may be defeating the optimization @enebo intended.

This is a very good find, and we'll have it fixed for 9.1.17 one way or another.

Here's my patch:

diff --git a/core/src/main/java/org/jruby/ir/representations/CFG.java b/core/src/main/java/org/jruby/ir/representations/CFG.java
index 7340837e62..7bb8a9a517 100644
--- a/core/src/main/java/org/jruby/ir/representations/CFG.java
+++ b/core/src/main/java/org/jruby/ir/representations/CFG.java
@@ -476,15 +476,18 @@ public class CFG {
                 addEdge(a, e.getDestination().getData(), e.getType());
             }
 
-            // Delete bb
-            removeBB(b);
-
             // Update rescue map
             if (aR == null && bR != null) {
                 setRescuerBB(a, bR);
             }
 
-            return true;
+            // Delete bb
+            if (inDegree(b) == 1) {
+                removeBB(b);
+                return true;
+            } else {
+                return false; // target only swallowed because someone else needs it
+            }
         } else {
             return false;
         }
diff --git a/rakelib/commands.rake b/rakelib/commands.rake
index 7b6ee3f439..03fe5fa00d 100644
--- a/rakelib/commands.rake
+++ b/rakelib/commands.rake
@@ -21,6 +21,25 @@ def jvm_model
   end
 end
 
+class JRubyLauncher
+  def initialize
+    @classpath = [
+        File.join(BUILD_LIB_DIR, 'junit.jar'),
+        File.join(BUILD_LIB_DIR, 'bsf.jar'),
+        TEST_CLASSES_DIR,
+        File.join(TEST_DIR, 'jruby', 'requireTest.jar'),
+        TEST_DIR
+    ]
+    @java_options = {}
+  end
+
+  def run(options, &code)
+    @java_options.merge(options)
+    instance_eval(&code) if code
+
+  end
+end
+
 def initialize_paths
   self.class.const_set(:JVM_MODEL, jvm_model)
 

enebo added a commit that referenced this issue Apr 14, 2018

Fix #5138. jruby > 9.1.13.0 crashing on bundler when certain jvm flag…
…s are

enabled.

Better title would be CFG merges and deletes a BB other BBs still think they
are pointing to.

This seems reasonable to me at the moment and I want to land so if someone is
enterprising they can try and find more problems.  I will have people look at
this.

@enebo enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.1.17.0 Apr 14, 2018

@enebo

This comment has been minimized.

Member

enebo commented Apr 14, 2018

Ok a little dinner and my code just works! @subbuss can you review this change?

The base problem was that there was an optimization which would merge an empty BB that it falls through to into a second BB. The issue was that our mergeBB merges the second BB into the first BB. The second BB has incoming edges from jumps or branches but now the label of those are pointing to a non-existent BB (merge deletes it). Also CFG has outgoing edges point to this old deleted BB. A bad time for all. The actual execution was even weirder since it would pick an ipc for the jump based on where the old Label's BB had been. So we would just randomly jump into an instr without executing
other critical instrs right before it.

Solution (A -fallthrough-> B with mergeBBs(A, B) will merge B into A):

  1. all incoming edges to B get redone as edges to A
  2. jump/branch instr targetLabel is set to A

So an obvious question to all this is why don't we just merge A into B. In how we use mergeBBs today that might be a simpler solution but this code is more robust since it can handle the relocation. This means when we find other mergeable conditions it will handle fixing up the labels for us.

@jsvd can you try and kick the tires A LOT with your configuration. This issue made me realize that our Full Interpreter only coverage is lacking.

PS- Another mystery @headius and I wondered about for a little bit was why JIT worked. The short answer is that it calculates where to jump and does not rely on ipc offsets. It is still a little unclear how a jump to a nonexistant label works but perhaps the CFG is still maintaining enough info after the merge?

@enebo enebo closed this Apr 14, 2018

@subbuss

This comment has been minimized.

Contributor

subbuss commented Apr 14, 2018

I was trying to figure out how I botched the mergeBBs so badly. Looking at git history, I realized I named it pretty badly and didn't document it properly either. It was only being used to merge straight line basic blocks where this scenario wouldn't have happened. I should have named it mergeStraightLineBBs since it was being used in collapseStraightLineBBs. In ffa0151, you used it without realizing that mergeBBs isn't set up to handle general BB merges.

Your fix is correct. It generalizes merging for non straight line BBs by fixing up incoming edges of 'b', not just outgoing edges of 'b', as well as the jumps that jump along those incoming edges. So, mergeBBs now lives up to its name.

Nice sleuthing and fixing!

@enebo

This comment has been minimized.

Member

enebo commented Apr 14, 2018

@subbuss thanks for checking. Yeah, I really should have realized we did nothing with incoming edge to 'b' in that method. I even had another regression earlier and missed that fact the second time I looked at that code.

@headius

This comment has been minimized.

Member

headius commented Apr 16, 2018

@enebo In the JIT I use the location in a hash of bytecode labels, so I presume that the target label still exists at the right logical location in the instructions. If it didn't, I'd get to the end of the code with a jump to a label that was never again encountered, and ASM would refuse to emit the bytecode.

@jsvd Note that this fix explains all the odd behavior we saw: line number was off (because it skipped over a line number instruction), adding code changed behavior (because it shifted instructions forward beyond the jump location), and of course the original issue where it appeared to be getting nil when it should not. A very good discovery...thank you for helping us reduce it!

@jsvd

This comment has been minimized.

jsvd commented Apr 16, 2018

Many thanks @headius @enebo @subbuss for the quick response, and great job tracking the bug down 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment