JRuby's impl of FFI differs in behavior from MRI; segfaults/crashes #4413

Open
npomf opened this Issue Dec 28, 2016 · 13 comments

Projects

None yet

3 participants

@npomf
npomf commented Dec 28, 2016 edited

Environment

Environment details:

  • Java HotSpot(TM) 64-Bit Server VM (25.51-b03) for windows-amd64 JRE (1.8.0_51-b16), built on Jun 8 2015 18:03:07 by "java_re" with MS VC++ 10.0 (VS2010)
  • Windows 7 , 64 bit Build 7601 (6.1.7601.23572)
  • JRuby 9.1.6.0

Gems in use:

  • FFI 1.9.14-java

Additional source code:

Expected Behavior

  • The syntax of arg || CONSTANT (evaluates to nil || 0x00 # => 0) should not cause a TypeError.
  • When a Memory Pointer with a length which matches the source ASCII-8Bit -encoded string from which it is derived is passed, a NULL should not be received.
  • Execution of code which operates correctly in Ruby 2.3.3 (MRI) should not greatly differ when executed in JRuby 9k.

Actual Behavior

All observations are currently drawn from the source file environment.rb in the Hephaestus project, a wrapper for Vulkan written in Ruby. Insofar it works as expected on the pre-compiled mingw32-x64 build of ruby-ffi, but the -java derivative is encountering strange behavioral differences.

  • At line 245, from || VK_NULL_HANDLE, which by default translates to nil || 0x00 causes a TypeError ("no implicit conversion of FFI::MemoryPointer to Integer").
  • Likewise at line 318, pCode: bin.to_ptr assignment using the Struct constructor mixin to assign from kwargs causes the following, after which JRuby may segfault or simply crash with an OS-level "application not responding" warning:
ParameterValidation(ERROR): object: 0x0 type: 0 location: 190 msgCode: 4: vkCre
ateShaderModule: required parameter pCreateInfo->pCode specified as NULL
SC(ERROR): object: 0x0 type: 0 location: 10277 msgCode: 5: SPIR-V module not va
lid: Invalid SPIR-V magic number.

Inside the diagnostic output (enabled with -v on the jruby-fixes branch), the reference pCode is clearly identified as being an FFI::MemoryPointer with a real address and a length of 1281 bytes (+1 byte from appending \x00, which is reflected from core FFI behaviors).

The .to_ptr contrivance is another mix-in which allows most basic Ruby objects to be rendered into FFI::MemoryPointer instances by translating their contents. In the above case, it is shorthand for FFI::MemoryPointer.from_string( bin ).

Caveats

Redoing the test without the validation layer causes the crash regardless, making me suspect it is only potentially not an issue with Vulkan (as the SDK is now being bypassed) which is causing the crash. However, I do recall there having been a potential crash in MRI before when I had accidentally passed in the raw GLSL files (as opposed to the correct SPIR-V files).

This does not change the fact that identical syntax between the two implementations of Ruby are not behaving in an identical manner, or that a pointer appears to be shedding or losing its content without warning as though it were being released / GC'd while within the scope of its creation.

JVM Dump

I've captured some findings (replicated above), as well as the output from the JVM when the crash is encountered. You can view it in this gist here.

Test File

I've reproduced the test file below, which uses the raw source of the project (as opposed to the gem, to allow more rapid development). It has not been altered from the code I would use for my personal machine, and thus the adapter name and environment variables may not reflect what you would use.

# modify environment
ENV['PATH'] = './bin;' + ENV['PATH']
ENV['VK_LAYER_PATH'] = 'D:/VulkanSDK/1.0.33.0/Bin'

# load and track time
top = Time.now
require_relative '../vulkan'
require_relative '../lib/utils'
include Vulkan, GLFW3
puts "Loaded Vulkan in #{Time.now - top} seconds"

# mark head
$mark = Time.now
$failed = false

# setup
begin
    puts 'Creating environment ...'
    $env = Vulkan::Environment.new(
        'Test',
        adapter: 'GTX 660',
        validation: 'VK_LAYER_LUNARG_standard_validation'
    )

    puts 'Setting up window ...'
    win = Vulkan::Window.new( 1280, 800, $env )
    puts 'Finishing initialization: '

    puts ' - create rendering path ...'
    $env.create_default_rendering_path

    puts ' - create framebuffers ...'
    $env.create_framebuffers

    puts ' - create commandpool ...'
    $env.create_commandpool

    puts ' - create command buffers ...'
    $env.create_commandbuffers

    puts 'Done.'
rescue Exception => err
    root = Dir.pwd + '/'
    warn "#{err.class.name}: #{err}", *err.backtrace.map {|line| "    " + line.gsub( root, '') }
    $failed = true
ensure
    dT = Time.now - $mark
    puts "Initialization attempt #{$failed ? 'FAILED' : 'succeeded'} after #{dT} seconds",
         "Environment details:",
         "--------------------",
         $env.inspect
end
@headius headius added this to the JRuby 9.1.7.0 milestone Jan 4, 2017
@headius
Member
headius commented Jan 4, 2017

Looking into this a bit now, hopefully for 9.1.7.0.

@headius
Member
headius commented Jan 5, 2017

Ok, I'm going to need a little more help with this. I'd like to be able to reproduce, but getting this set up with appropriate paths to Vulkan is confusing. I'm also on Linux...so it may not even be possible to get it running before we need to push a 9.1.7.0 release.

Two things could help me here:

  • If you can sort out individual calls that are failing and simulate them with a simple C file and Ruby script, I can reproduce anywhere. I realize that's a bit of work.
  • Alternatively, if you can create a temporary repo that has all paths set up with a reproduction script, I can give it a shot. If Windows is the only option for you, that's fine too...it just means I'll have to take some extra time to set up a Windows VM.

The conversion error should be simple enough. Presumably the CRuby FFI just takes the Integer value and assumes you're giving it a valid pointer, while the JRuby version demands an actual MemoryPointer. Problem is, I can't see who would be raising it and I don't have a backtrace. Is it just trying to construct a struct?

If you can get backtraces for the non-crashers, with and without -Xbacktrace.style=full I might be able to fix it without even running it.

The other crasher is likely similar; some value is not coercing properly, so we end up with a null pointer that doesn't manifest until Vulkan eventually tries to access that field of the struct.

@npomf
npomf commented Jan 5, 2017

Error site details

As far as I can tell, the following is where our error happens:

# Create a shader module from a SPIR-V file.
def load_shader( path )
    bin = open( path, 'rb'){|io| io.read }
    smci = VkShaderModuleCreateInfo.new(
        sType: :VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO,
        codeSize: bin.length,
        pCode: bin.to_ptr   # FIXME jruby validation sees NULL; MemoryPointer exposed in diag
    )

    # create it
    shader = FFI::MemoryPointer.new( :pointer )
    ok = vkCreateShaderModule( @logical_device, smci, nil, shader )  # KABOOM
    raise LoadError, "Failed to load shader: #{ok}" unless ok.eql? :VK_SUCCESS
    VkShaderModule.new( shader.read_pointer )
end

The line with the FIXME comment will correctly show diagnostic data in debug mode of the expected pointer, which is only printed after assignment, so it's definitely not the FFI::Struct causing the fault.

Diag output
This output appears when the shader module is created in debug mode.

DIAG: #<Vulkan::VkShaderModuleCreateInfo:0x7cb502c>: sType (:sType) -> :VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO
DIAG: #<Vulkan::VkShaderModuleCreateInfo:0x7cb502c>: codeSize (:codeSize) -> 1280
DIAG: #<Vulkan::VkShaderModuleCreateInfo:0x7cb502c>: pCode (:pCode) -> #<FFI::MemoryPointer address=0x2d0000 size=1281>

From it we can tell that a pointer of 1,281 bytes in length is created, which matches the same length reported by MRI. So that's good.

The second comment, "KABOOM," is the site of the actual segmentation fault. For some reason the struct gets passed in with pCreateInfo->pCode = NULL, which is causing Vulkan to lose its mind, and is the error printed in the validation layer's diagnostics before failing.

LunarG validation output

ParameterValidation(ERROR): object: 0x0 type: 0 location: 190 msgCode: 4: vkCreateShaderModule: required parameter pCrea
teInfo->pCode specified as NULL
SC(ERROR): object: 0x0 type: 0 location: 10277 msgCode: 5: SPIR-V module not valid: Invalid SPIR-V magic number.

This, tied with the reference to nvoglv64.dll+0xedae47 as the problematic frame in question is making me think that the segfault is potentially part of Vulkan. The NULL content of the pointer, however, is definitely JRuby.

Predefined test script

Added the branch jruby-hotfix which includes a version of the file that will configure itself (hopefully correctly) for both standard Windows installs and Linux, presuming the instructions for LunarG's SDK on Linux are correct and they appear in the /home of a user. You might have to substitute version numbers, however.

If you can help me with producing the backtraces, I'll be happy to provide all details possible! JRuby prints the following (where "test/make_window.rb" is the original to the "preconf_test.rb" file):

D:\git\hephaestus>jruby -v -d -Xbacktrace.style=full -Xrunhprof test/make_window.rb
jruby: warning: unknown property jruby.runhprof
@npomf
npomf commented Jan 5, 2017

Just did a quick diagnostic test. This is what's printed out after our VkShaderModuleCreateInfo struct is created:

DIAG: #<Vulkan::VkShaderModuleCreateInfo:0x275bf9b3>: pCode (:pCode) -> #<FFI::MemoryPointer ad
dress=0x211ab0 size=1281>
INFO: {:sType=>:VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO, :pNext=>#<FFI::Pointer address=0x0
>, :flags=>0, :codeSize=>1280, :pCode=>#<FFI::Pointer address=0x211ab0>}

For some reason, dumping its values is reflecting an FFI::Pointer instead of the FFI::MemoryPointer passed in, but which has the same address?

Could that be what's causing our problem, or am I looking at the wrong thing?

@headius
Member
headius commented Jan 6, 2017

Your command line would need -J in front of -Xrunhprof if you want that, but I'm not sure what you are using that flag for (it's a JVM-level profiler). The backtrace flag looks correct.

The pointer inside the struct probably is an FFI::Pointer. FFI::MemoryPointer#to_ptr returns this. But the handling of the struct when passed to native is not consuming FFI::Pointer properly. What does it do if you yank the actual pointer integer out and pass that for pCode?

@npomf
npomf commented Jan 6, 2017 edited

Seems the segfault is preventing the JVM from properly dumping the backtraces; the file contains only the usual header and the .TMP of the output is empty. It's definitely working otherwise, though, as there is a noticeable performance difference as the profiler runs.

Using the corrected command line, when I swap the pointer to be a raw integer, I get the following. It fails to coerce the result of .to_i() on the pointer, and prevents the struct from being created:

DIAG: #<Vulkan::VkShaderModuleCreateInfo:0x2049a9c1>: codeSize (:codeSize) -> 1280
TypeError: no implicit conversion of FFI::MemoryPointer into Integer
    java/lang/Thread.java:1552:in `getStackTrace'
    org/jruby/runtime/backtrace/TraceType.java:246:in `getBacktraceData'
    org/jruby/runtime/backtrace/TraceType.java:47:in `getBacktrace'
    org/jruby/RubyException.java:236:in `prepareBacktrace'
    org/jruby/exceptions/RaiseException.java:216:in `preRaise'
    org/jruby/exceptions/RaiseException.java:183:in `preRaise'
    org/jruby/exceptions/RaiseException.java:111:in `<init>'
    org/jruby/Ruby.java:4182:in `newRaiseException'
    org/jruby/Ruby.java:3853:in `newTypeError'
    org/jruby/util/TypeConverter.java:400:in `handleImplicitlyUncoercibleObject'
    org/jruby/util/TypeConverter.java:396:in `handleUncoercibleObject'
    org/jruby/util/TypeConverter.java:99:in `convertToType19'
    org/jruby/util/TypeConverter.java:69:in `convertToType'
    org/jruby/RubyBasicObject.java:744:in `convertToInteger'
    org/jruby/RubyNumeric.java:219:in `other2long'
    org/jruby/RubyNumeric.java:207:in `num2long'
    org/jruby/ext/ffi/Util.java:100:in `longValue'
    org/jruby/ext/ffi/Util.java:75:in `int64Value'
    org/jruby/ext/ffi/StructLayout.java:1139:in `put'
    org/jruby/ext/ffi/StructLayout.java:495:in `put'
    org/jruby/ext/ffi/Struct.java:293:in `setFieldValue'
    org/jruby/ext/ffi/Struct$INVOKER$i$2$0$setFieldValue.gen:-1:in `call'
    org/jruby/internal/runtime/methods/JavaMethod.java:740:in `call'
    org/jruby/ir/targets/InvokeSite.java:152:in `fail'
    lib/structs.rb:49:in `block in new'
    org/jruby/runtime/CompiledIRBlockBody.java:156:in `yieldDirect'
    org/jruby/runtime/MixedModeIRBlockBody.java:122:in `yieldDirect'
    org/jruby/runtime/BlockBody.java:108:in `yield'
    org/jruby/runtime/Block.java:167:in `yield'
    org/jruby/RubyHash.java:1363:in `visit'
    org/jruby/RubyHash.java:1360:in `visit'
    org/jruby/RubyHash.java:669:in `visitLimited'
    org/jruby/RubyHash.java:654:in `visitAll'
    org/jruby/RubyHash.java:1320:in `iteratorVisitAll'
    org/jruby/RubyHash.java:1355:in `each_pairCommon'
    org/jruby/RubyHash.java:1344:in `each'
    org/jruby/RubyHash$INVOKER$i$0$0$each.gen:-1:in `call'
    org/jruby/internal/runtime/methods/JavaMethod.java:497:in `call'
    org/jruby/runtime/callsite/CachingCallSite.java:77:in `callBlock'
    org/jruby/runtime/callsite/CachingCallSite.java:83:in `call'
    org/jruby/ir/instructions/CallBase.java:428:in `interpret'
    org/jruby/ir/interpreter/InterpreterEngine.java:356:in `processCall'
    org/jruby/ir/interpreter/StartupInterpreterEngine.java:73:in `interpret'
    org/jruby/ir/interpreter/InterpreterEngine.java:84:in `interpret'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:179:in `INTERPRET_METHOD'
    lib/structs.rb:48:in `new'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:165:in `call'
    org/jruby/internal/runtime/methods/DynamicMethod.java:200:in `call'
    org/jruby/runtime/callsite/CachingCallSite.java:338:in `cacheAndCall'
    org/jruby/runtime/callsite/CachingCallSite.java:163:in `call'
    org/jruby/ir/interpreter/InterpreterEngine.java:315:in `processCall'
    org/jruby/ir/interpreter/StartupInterpreterEngine.java:73:in `interpret'
    org/jruby/ir/interpreter/InterpreterEngine.java:84:in `interpret'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:179:in `INTERPRET_METHOD'
    lib/util/environment.rb:315:in `load_shader'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:165:in `call'
    org/jruby/internal/runtime/methods/DynamicMethod.java:200:in `call'
    org/jruby/runtime/callsite/CachingCallSite.java:338:in `cacheAndCall'
    org/jruby/runtime/callsite/CachingCallSite.java:163:in `call'
    org/jruby/ir/interpreter/InterpreterEngine.java:315:in `processCall'
    org/jruby/ir/interpreter/StartupInterpreterEngine.java:73:in `interpret'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:109:in `INTERPRET_METHOD'
    lib/util/environment.rb:591:in `create_rendering_path'
    org/jruby/internal/runtime/methods/MixedModeIRMethod.java:95:in `call'
    org/jruby/internal/runtime/methods/AliasMethod.java:101:in `call'
    org/jruby/ir/targets/InvokeSite.java:140:in `invoke'
    test/make_window.rb:30:in `<main>'
    java/lang/invoke/MethodHandle.java:625:in `invokeWithArguments'
    org/jruby/ir/Compiler.java:111:in `load'
    org/jruby/Ruby.java:846:in `runScript'
    org/jruby/Ruby.java:761:in `runNormally'
    org/jruby/Ruby.java:779:in `runNormally'
    org/jruby/Ruby.java:592:in `runFromMain'
    org/jruby/Main.java:425:in `doRunFromMain'
    org/jruby/Main.java:313:in `internalRun'
    org/jruby/Main.java:242:in `run'
    org/jruby/Main.java:204:in `main'

That's the same issue I had first encountered where the constant 0x00 would raise the error.

The profiler appears to have output 119MB of data. The files are here:

@npomf
npomf commented Jan 6, 2017

Re-ran without the profiler to validate that the integer value is correct. It appears to be:

INFO: creating VkShaderModuleCreateInfo with: #<FFI::MemoryPointer address=0x390000 size=1281> 
at 3735552
@headius
Member
headius commented Jan 6, 2017

I'm confused...that memory address doesn't look valid, and it seems like it's still trying to create the struct using the MemoryPointer rather than the Integer value from your logging output. Can you show me the updated code for clarification?

@npomf
npomf commented Jan 6, 2017 edited

Original code:

# Create a shader module from a SPIR-V file.
def load_shader( path )
    bin = open( path, 'rb'){|io| io.read }
    smci = VkShaderModuleCreateInfo.new(
        sType: :VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO,
        codeSize: bin.length,
        pCode: bin.to_ptr  # equivalent to FFI::MemoryPointer.from_string( bin )
    )

    # create it
    shader = FFI::MemoryPointer.new( :pointer )
    ok = vkCreateShaderModule( @logical_device, smci, nil, shader )
    raise LoadError, "Failed to load shader: #{ok}" unless ok.eql? :VK_SUCCESS
    VkShaderModule.new( shader.read_pointer )
end

Updated code to use the integer value:

# Create a shader module from a SPIR-V file.
def load_shader( path )
    bin = open( path, 'rb'){|io| io.read }
    ptr = bin.to_ptr  # equivalent to FFI::MemoryPointer.from_string( bin )
    loc = ptr.to_i    # pull integer address 
    puts "INFO: creating VkShaderModuleCreateInfo with: #{ptr} at #{loc}"
    smci = VkShaderModuleCreateInfo.new(
        sType: :VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO,
        codeSize: bin.length,
        pCode: loc  # pass in the integer address instead of the pointer; causes TypeError
    )
    puts "INFO: #{smci.to_h}"

    # create it
    shader = FFI::MemoryPointer.new( :pointer )
    ok = vkCreateShaderModule( @logical_device, smci, nil, shader )
    raise LoadError, "Failed to load shader: #{ok}" unless ok.eql? :VK_SUCCESS
    VkShaderModule.new( shader.read_pointer )
end

I'm still really unsure of why it's saying conversion of FFI::MemoryPointer into Integer when the opposite order is what's being passed in. Regardless, the TypeError from the previous comment occurs during the assignment of the pCode member of the struct. We aren't reaching the invocation of vkCreateShaderModule() anymore.

The long format of the code without my mixins would be:

def load_shader( path )
    bin = open( path, 'rb'){|io| io.read }
    ptr = FFI::MemoryPointer.from_string( bin )
    loc = ptr.to_i    # pull integer address 
    puts "INFO: creating VkShaderModuleCreateInfo with: #{ptr} at #{loc}"
    smci = VkShaderModuleCreateInfo.new()
    smci[:sType] = :VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO
    smci[:codeSize] = bin.length
    smci[:pCode] = loc # <- TypeError is raised here
    puts "INFO: #{smci.to_h}"

    # create it
    shader = FFI::MemoryPointer.new( :pointer )
    ok = vkCreateShaderModule( @logical_device, smci, nil, shader )
    raise LoadError, "Failed to load shader: #{ok}" unless ok.eql? :VK_SUCCESS
    VkShaderModule.new( shader.read_pointer )
end
@headius
Member
headius commented Jan 7, 2017

Ok, so the TypeError with the raw Integer is probably because the struct says it wants a :pointer. If you change that to something suitable for a numeric pointer on your system (long or ulong I'd guess) the modified code would work.

But it still begs the question why :pointer is having trouble with an FFI::MemoryPointer. Hmm.

@headius
Member
headius commented Jan 7, 2017

Am I right in thinking that all errors so far appear to be problems getting the pointer to propagate into native code properly? Values not coercing, null pointers getting dereferenced, etc?

@npomf
npomf commented Jan 7, 2017

I ... believe so? Insofar the show-stopper has been that weird null out of nowhere, so I would assume it would be an issue of translating the pointer to native code, as introspection does correctly show what I would expect once it's loaded into the FFI Struct.

Once it moves to native code and is consumed, however, we get the issue. So yes, that could well be it.

@npomf
npomf commented Jan 7, 2017 edited

Interesting. I used .to_java( :long ) on the address and got a different error:

INFO: creating VkShaderModuleCreateInfo with: #<FFI::MemoryPointer address=0x4519d0 size=1281> 
at 4528592
ArgumentError: Invalid pointer value
    org/jruby/ext/ffi/Struct.java:293:in `[]='
    lib/structs.rb:49:in `block in new'
    org/jruby/RubyHash.java:1344:in `each'
    lib/structs.rb:48:in `new'
    lib/util/environment.rb:318:in `load_shader'
    lib/util/environment.rb:594:in `create_rendering_path'
    test/make_window.rb:30:in `<main>'

So it definitely knows how to handle that, possibly maybe.

If I instead change the struct's layout as follows, the Ruby integer address (no java conversions this time) takes just fine but we crash Java and get the same SPIR-V magic number failure output from the validation layer.

class Vulkan::VkShaderModuleCreateInfo < FFI::Struct
    layout sType:           :VkStructureType,
           pNext:           :pointer,
           flags:           :VkShaderModuleCreateFlags,
           codeSize:        :size_t,
           pCode:           :ulong
end

And our old friend:
java-stopped

@enebo enebo modified the milestone: JRuby 9.1.8.0, JRuby 9.1.7.0 Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment