Marshal.load encoding error #456

Closed
fidothe opened this Issue Dec 19, 2012 · 13 comments

Projects

None yet

3 participants

@fidothe
fidothe commented Dec 19, 2012

When trying using the tactful_tokenizer gem, you get this error.

matt$ bundle exec irb
irb(main):001:0> require 'tactful_tokenizer'
=> true
irb(main):002:0> m = TactfulTokenizer::Model.new; nil
ArgumentError: invalid encoding in marshaling stream: f
    from org/jruby/RubyMarshal.java:148:in `load'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:55:in `initialize'
    from org/jruby/RubyIO.java:1183:in `open'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:54:in `initialize'
    from org/jruby/RubyArray.java:2360:in `map'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:53:in `initialize'
    from (irb):52:in `evaluate'
    from org/jruby/RubyKernel.java:1066:in `eval'
    from org/jruby/RubyKernel.java:1392:in `loop'
    from org/jruby/RubyKernel.java:1174:in `catch'
    from org/jruby/RubyKernel.java:1174:in `catch'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/bin/irb:13:in `(root)'

I've confirmed that it works fine in MRI 1.9.3-p327.

tactful_tokenizer: https://github.com/SlyShy/Tactful_Tokenizer (though it's unclear exactly which version was shipped as 0.0.2, so you're probably better off with the gem sources themselves)

@shepmaster
Contributor

There must be a discrepancy between the 1.8 and 1.9 mode Marshal:

$ JRUBY_OPTS=--1.9 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
ArgumentError: invalid encoding in marshaling stream: f
$ JRUBY_OPTS=--1.8 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
# Lots of data...

If you can run your code as 1.8 mode, that might be a quick workaround.

@shepmaster
Contributor

Opening the marshaled data in MRI 1.9.3p327 and marshaling it again produces a file that can be read by JRuby 1.7.0 in 1.9 mode. This suggests the problem is with the 1.8 MRI Marshal format in JRuby 1.9 mode.

I thought that Marshal data wasn't supported across different versions?

@shepmaster
Contributor

This isn't quite right:

I thought that Marshal data wasn't supported across different versions?

From the docs:

Marshaled data has major and minor version numbers stored along with the object information. In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number. [...] Marshal versioning is independent of Ruby’s version numbers.

@fidothe
fidothe commented Dec 19, 2012

And in this case, examining the Marshal major/minor version numbers are the same (from examining the first two bytes of features.mar and the first two bytes of Marshal.dump("text"). AFAICT JRuby should be able to read this dump.

@shepmaster
Contributor

The test for encoding when unmarshaling strings seems suspicious to me. It tests for count >= 1, but count has to be >= 1 in order for the loop to execute. Or am I going crazy?

I've changed that locally to be > 1 and am running the tests.

@shepmaster
Contributor

Here's a smaller reproduction:

In Ruby 1.9.1-p431, via irb -E US-ASCII:US-ASCII

x = {"a" => 1.0, "b" => 2.0}
File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }

In JRuby 1.7.1 (1.9 mode)

File.open('191-hash.dmp') { |f| Marshal.load(f) }
# => ArgumentError: invalid encoding in marshaling stream: f
@shepmaster
Contributor

I had to go back to Ruby 1.9.1 because 1.9.2 introduced a shortcut for UTF-8 / US-ASCII encoding (E). The original marshal file goes out of its way to say US-ASCII.

@shepmaster
Contributor

You also need to have two items in the hash. I'm not super familiar with the Marshal format, but my guess from looking at the hex is that the second key doesn't specify the encoding to save space, and it's expected to just pick it up from the previous one. Pure speculation on my part though.

@shepmaster
Contributor

The existing tests pass with the change to make it count > 0, but beyond that I'm not sure if this is the right path to take.

@shepmaster
Contributor

OK, my failed tests led me to dig deeper. Here's what I think is happening now:

When MRI writes an encoding that is not US-ASCII / UTF-8, it writes it as a normal String. This means that it can use the LINK type to reference previously written strings. See the end of MRI's marshal.c w_encoding function.

JRuby does not treat the encoding equivalent to a marshaled String, instead it ignores the string type marker " and then reads/writes a set of bytes straight. See MarshalStream writeEncoding and UnmarshalStream defaultVariablesUnmarshal.

This means you can reproduce with a modern MRI, like ruby 1.9.3p327:

x = ['a', 'b'].map {|s| s.force_encoding('Shift_JIS')}
File.open('193-array', 'w') {|f| Marshal.dump(x, f)}

It's important that the two strings not be the same, otherwise they will be marked as total duplicates and will be handled correctly.

@shepmaster
Contributor

Both marshaling and unmarshaling are affected in the same way. This means JRuby cannot read certain MRI data, but MRI should be able to read JRuby data, although it will be bigger than necessary. I did not test this however.

@dekz dekz referenced this issue in vmware/rbvmomi Feb 20, 2013
Closed

Does not work with JRuby #6

@headius
Member
headius commented Jul 24, 2013

I tested this on master and it seems to work ok:

$ irb -E US-ASCII:US-ASCII
irb(main):001:0> x = {"a" => 1.0, "b" => 2.0}
=> {"a"=>1.0, "b"=>2.0}
irb(main):002:0> File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }
=> #<File:191-hash.dmp (closed)>

So I'm going to optimistically mark this as fixed (unsure of release, so I'll go with 1.7.5).

@headius headius closed this Jul 24, 2013
@shepmaster
Contributor

IIRC, this was fixed by commit 82cda78, which doesn't seem to have any merge notes. Maybe it was cherry-picked? If so, should be fixed back to 1.7.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment