Code behavior changes after being JITted at runtime #4425

Closed
mvdamme opened this Issue Jan 7, 2017 · 8 comments

Projects

None yet

3 participants

@mvdamme
mvdamme commented Jan 7, 2017

Hi, I'm having a strange issue while upgrading an old Rails application to JRuby 9.1.6.0 (latest released version, installed using rvm).

I have a long-running excel-export that used to work without issues under 1.7 (it uses apache POI to write an Excel file).
Using 9.1.6.0, part of the code seems to be JIT-compiled after writing 52 rows to the excel file (at least, that's what I suspect). As of row 53, the code behavior changes causing some columns not to be exported any longer. If I run the export a second time, the problem occurs from the start (i.e. as of row 1).

When I disable the JIT compiler (using JRUBY_OPTS= -J-Djruby.compile.mode=OFF in .ruby-env) the problem disappears entirely, and the export works correctly for all rows and it keeps working when running it again.

Environment

  • output of jruby -v: jruby 9.1.6.0 (2.3.1) 2016-11-09 0150a76 OpenJDK 64-Bit Server VM 24.121-b00 on 1.7.0_121-b00 [linux-x86_64]
  • JRUBY_OPTS not set (at least when the problem occurs, setting it to -J-Djruby.compile.mode=OFF makes the problem disappear)
  • output of uname -a: Linux xps 3.13.0-35-generic #62~precise1-Ubuntu SMP Mon Aug 18 14:52:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The application is a Rails app (being upgraded to Rails 4.2, the problem occurred under 4.2.7.1).

Problem code

To give an idea of what the code is doing, here's a shortened version (not runnable):

def self.export_payments_general(from_date, thru_date, to_export, mode = :excel, filename = nil, &block)
  workbook  = Excel::ExcelWorkbook.new
  worksheet = workbook['book1']

  # BCCA fields (name and max length)
  fields = [
    [ 'Taal', 2 ],
    [ 'Prive persoon', nil ],
    [ 'Naam', 70 ],
    [ 'Vervolg naam', 80 ],
    [ 'Straat', 40 ],
    [ 'Vervolg straat', 50 ],
    [ 'Huisnr.', 7 ],
    [ 'Busnr.', 6 ],
    [ 'Postnr.', 10 ],
    # many other fields...
  ]

  row = 0
  col = 0
  # use block to auto increment col value
  set_cell = ->(value, bold = false, max_length = nil) do
    val = value
    val = (val.to_s)[0..max_length] if max_length
    worksheet[row][col].value = val
    make_bold(worksheet[row][col]) if bold
    col += 1
  end

  ## Create header
  fields.each do |field|
    set_cell[ field.first, true ]
  end

  ## create payment list
  if to_export == :payments
    applications = ...  # code to extract correct list of applications from the database
  end

  # loop over data
  total = applications.size
  I18n.with_locale( :en ) do
    row += 1
    applications.each_with_index do |app, index|
      app = app.specialize
      person = app.person
      col = 0

      inv_data = app.invoice_data
      set_cell[ person.language.locale[0].upcase ]
      set_cell[ bool_to_string( app.invoice_to == 'person' ) ]
      set_cell[ inv_data[:name], false, fields[col].last ]
      col += 1  # name_ext
      set_cell[ inv_data[:street], false, fields[col].last ]
      col += 1  # street_ext
      set_cell[ inv_data[:number], false, fields[col].last ]
      set_cell[ inv_data[:bus], false, fields[col].last ]
      set_cell[ inv_data[:zipcode], false, fields[col].last ]
      # ... many other calls to set_cell block

      row += 1
      
      # report progress
      block.call(index+1, total) if block_given?
    end
  end

  # code to write xls file (omitted)
end

Explanation:

set_cell is a lambda that writes a value (possibly shortened to respect constraints) to the cell with coordinates [row, col] (both row and col are created before set_cell and are hence available), after which it increments col. A successive call to set_cell then writes its value to the next column.
When looping over the data, for each row col is initialized to zero, and then incremented inside set_cell. Some columns have to remain empty, so then we just increment col ourselves.

This works only for a limited number of rows when the JIT compiler is enabled. What seems to happen after JIT compilation is the following:

  • When starting the export of a new row, col is initialized to zero
  • The set_cell lambda works, correctly incrementing col (which also starts at zero), but the link between col inside the lambda and col inside the loop (which should be the same variable) seems to be lost. Successive calls to set_cell still increment their own version of col, but this doesn't affect the value of col outside the lambda (i.e. the value of col as seen in the loop doesn't change).
  • When executing col += 1 inside the main loop for the first time, col takes the value of 1 (since this version of col wasn't changed by the lamba, although it should have been). For some reason, this restores the link between both col versions, and as of that moment incrementing col inside the lambda (by calls to set_cell is correctly reflected in col as available in the loop).
  • For the export, this implies that some columns are overwritten.
  • During successive calls to set_cell, the link between both versions of col is lost and then recovered again.

So far, I haven't been able to reproduce this issue outside the context of this application.

@mvdamme
mvdamme commented Jan 8, 2017

I have tried with some other JRuby releases, and the problem seems to have been introduced in version 9.1.0.0.
The application works normally under 9.0.5.0, but as of 9.1.0.0 the problem appears (when JIT is enabled). I've tested versions 9.1.0.0, 9.1.2.0 and 9.1.6.0, and they all exhibit the problem.

@enebo enebo added this to the JRuby 9.1.7.0 milestone Jan 9, 2017
@enebo enebo added the jit label Jan 9, 2017
@headius
Member
headius commented Jan 9, 2017

You could help confirm it's the JIT by running with -Xjit.threshold=0. This will attempt to jit everything immediately and turn off background compilation.

Your guess that it happens after the 52nd row is not far off; the JIT normally fires at 50 invocations, but it does so on a background thread. So it's reasonable to assume that it wasn't until the 53rd row that the jitted code is actually installed.

@enebo
Member
enebo commented Jan 9, 2017

Condensed version:

BIG_ARRAY = (0...1).to_a

def foo
  col = 0
  lamb = ->(a, b) do
    col += 1
    p col
  end

  BIG_ARRAY.each_with_index do |a, i|
    col = 0
    lamb[a, i]
    col += 1
    puts "HERE"
    lamb[a, i]
  end
  
  
end


foo

This does not occur after 50 times because we will compile the entire main script file by default. col reads look like they read from inner col and writes go to outer col (whereas lambda access to col I think is fine). Pretty weird. IR Dump time.

@enebo
Member
enebo commented Jan 9, 2017

ok OptDynScopePass is busted:

Before that pass:

Graph:
BB [1:CL2_LBL_7:-1]:>[2,6]
BB [2:CL2_LBL_8:-1]:>[3], <[1]
BB [3:_CLOSURE_START_0:8]:>[4,6], <[2]
BB [4:_GLOBAL_ENSURE_BLOCK__0:23]:>[6], <[3]
BB [6:CL2_LBL_9:-1]:<[1,3,4]

2017-01-09T15:31:32.275-06:00 [main] INFO BasicCompilerPassListener : 
Instructions[work/snippets/err12.rb#9#,IRClosure,DeadCodeElimination]:
BB [1:CL2_LBL_7:-1]
BB [2:CL2_LBL_8:-1]
	%self = recv_self()
	a(0:0:local=true) = recv_pre_reqd_arg(;i:0)
	i(0:1:local=true) = recv_pre_reqd_arg(;i:1)
BB [3:_CLOSURE_START_0:8]
	line_num(;n: 10)
	col(1:0) = copy(Fixnum:0)
	line_num(;n: 11)
	%cl_2_3 = call_2o(lamb(1:1), a(0:0:local=true), i(0:1:local=true) ;n:[], t:NO, cl:false)
	line_num(;n: 12)
	%cl_2_4 = call_1f(Fixnum:0, Fixnum:1 ;n:+, t:NO, cl:false)
	col(1:0) = copy(%cl_2_4)
	line_num(;n: 13)
	%cl_2_5 = copy(strdup(frozen:"HERE"))
	%cl_2_6 = call_1o(%self, %cl_2_5 ;n:puts, t:FU, cl:false)
	line_num(;n: 14)
	%cl_2_7 = call_2o(lamb(1:1), a(0:0:local=true), i(0:1:local=true) ;n:[], t:NO, cl:false)
	return(%cl_2_7)
BB [4:_GLOBAL_ENSURE_BLOCK__0:23]
	%cl_2_8 = recv_jruby_exc()
	%cl_2_9 = runtime_helper(%cl_2_8 ;method: HANDLE_BREAK_AND_RETURNS_IN_LAMBDA)
	return_or_rethrow_saved_exc(%cl_2_9)
BB [6:CL2_LBL_9:-1]

After:

Graph:
BB [1:CL2_LBL_7:-1]:>[2,6]
BB [2:CL2_LBL_8:-1]:>[3], <[1]
BB [3:_CLOSURE_START_0:8]:>[4,6], <[2]
BB [4:_GLOBAL_ENSURE_BLOCK__0:23]:>[6], <[3]
BB [6:CL2_LBL_9:-1]:<[1,3,4]

2017-01-09T15:31:32.275-06:00 [main] INFO BasicCompilerPassListener : 
Instructions[work/snippets/err12.rb#9#,IRClosure,OptimizeDynScopesPass]:
BB [1:CL2_LBL_7:-1]
BB [2:CL2_LBL_8:-1]
	%self = recv_self()
	%t_a_10 = recv_pre_reqd_arg(;i:0)
	%t_i_11 = recv_pre_reqd_arg(;i:1)
BB [3:_CLOSURE_START_0:8]
	line_num(;n: 10)
	col(0:0) = copy(Fixnum:0)
	line_num(;n: 11)
	%cl_2_3 = call_2o(lamb(0:1), %t_a_10, %t_i_11 ;n:[], t:NO, cl:false)
	line_num(;n: 12)
	%cl_2_4 = call_1f(Fixnum:0, Fixnum:1 ;n:+, t:NO, cl:false)
	col(0:0) = copy(%cl_2_4)
	line_num(;n: 13)
	%cl_2_5 = copy(strdup(frozen:"HERE"))
	%cl_2_6 = call_1o(%self, %cl_2_5 ;n:puts, t:FU, cl:false)
	line_num(;n: 14)
	%cl_2_7 = call_2o(lamb(0:1), %t_a_10, %t_i_11 ;n:[], t:NO, cl:false)
	return(%cl_2_7)
BB [4:_GLOBAL_ENSURE_BLOCK__0:23]
	%cl_2_8 = recv_jruby_exc()
	%cl_2_9 = runtime_helper(%cl_2_8 ;method: HANDLE_BREAK_AND_RETURNS_IN_LAMBDA)
	return_or_rethrow_saved_exc(%cl_2_9)
BB [6:CL2_LBL_9:-1]

So you can see the variables a and i end up become temps and col reduce its depth by one. It eliminated the scope. However, for this to be broken we must still be pushing a scope...

@enebo
Member
enebo commented Jan 9, 2017

Ok my debugging output is lying to me. I still have not quite figured this out but we are seeing full build output and if you look at this line in both:

	%cl_2_4 = call_1f(Fixnum:0, Fixnum:1 ;n:+, t:NO, cl:false)
	col(0:0) = copy(%cl_2_4)

You can see col(0:0) = 0 is propagating the 0 to this call and it definitely cannot do that because col might be modified by the lambda :)

@enebo enebo added a commit that closed this issue Jan 9, 2017
@enebo enebo Fixes #4425. Code behavior changes after being JITted at runtime
Revert "Address FIXMEs in LocalOptimizationPass"

This reverts commit d41fac5.

lvars were getting constant propagated via valueMap...
e136ac1
@enebo enebo closed this in e136ac1 Jan 9, 2017
@enebo enebo added the ir label Jan 9, 2017
@enebo
Member
enebo commented Jan 9, 2017

@subbuss I reverted your commit from June!!! Amazing we have not seen this reported before. Basic issue I think is that dataflow barrier was not considering localvariables properly since they can be changed in any call (we were prop'ing 0 from a captured lvar but a closure was modifying that lvar between the init and where it got propd to). I will leave it to you do reapply a better fix (or talk to me about how you think we should fix this).

@mvdamme
mvdamme commented Jan 10, 2017

I can confirm that it works well now when using master. Thanks for looking into this and fixing it so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment