New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marshal.load encoding error #456

Closed
fidothe opened this Issue Dec 19, 2012 · 13 comments

Comments

Projects
None yet
3 participants
@fidothe

fidothe commented Dec 19, 2012

When trying using the tactful_tokenizer gem, you get this error.

matt$ bundle exec irb
irb(main):001:0> require 'tactful_tokenizer'
=> true
irb(main):002:0> m = TactfulTokenizer::Model.new; nil
ArgumentError: invalid encoding in marshaling stream: f
    from org/jruby/RubyMarshal.java:148:in `load'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:55:in `initialize'
    from org/jruby/RubyIO.java:1183:in `open'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:54:in `initialize'
    from org/jruby/RubyArray.java:2360:in `map'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/lib/ruby/gems/shared/gems/tactful_tokenizer-0.0.2/lib/tactful_tokenizer.rb:53:in `initialize'
    from (irb):52:in `evaluate'
    from org/jruby/RubyKernel.java:1066:in `eval'
    from org/jruby/RubyKernel.java:1392:in `loop'
    from org/jruby/RubyKernel.java:1174:in `catch'
    from org/jruby/RubyKernel.java:1174:in `catch'
    from /Users/matt/.rbenv/versions/jruby-1.7.1/bin/irb:13:in `(root)'

I've confirmed that it works fine in MRI 1.9.3-p327.

tactful_tokenizer: https://github.com/SlyShy/Tactful_Tokenizer (though it's unclear exactly which version was shipped as 0.0.2, so you're probably better off with the gem sources themselves)

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

There must be a discrepancy between the 1.8 and 1.9 mode Marshal:

$ JRUBY_OPTS=--1.9 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
ArgumentError: invalid encoding in marshaling stream: f
$ JRUBY_OPTS=--1.8 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
# Lots of data...

If you can run your code as 1.8 mode, that might be a quick workaround.

Contributor

shepmaster commented Dec 19, 2012

There must be a discrepancy between the 1.8 and 1.9 mode Marshal:

$ JRUBY_OPTS=--1.9 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
ArgumentError: invalid encoding in marshaling stream: f
$ JRUBY_OPTS=--1.8 irb 
> File.open('features.mar') {|f| Marshal.load(f) }
# Lots of data...

If you can run your code as 1.8 mode, that might be a quick workaround.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

Opening the marshaled data in MRI 1.9.3p327 and marshaling it again produces a file that can be read by JRuby 1.7.0 in 1.9 mode. This suggests the problem is with the 1.8 MRI Marshal format in JRuby 1.9 mode.

I thought that Marshal data wasn't supported across different versions?

Contributor

shepmaster commented Dec 19, 2012

Opening the marshaled data in MRI 1.9.3p327 and marshaling it again produces a file that can be read by JRuby 1.7.0 in 1.9 mode. This suggests the problem is with the 1.8 MRI Marshal format in JRuby 1.9 mode.

I thought that Marshal data wasn't supported across different versions?

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

This isn't quite right:

I thought that Marshal data wasn't supported across different versions?

From the docs:

Marshaled data has major and minor version numbers stored along with the object information. In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number. [...] Marshal versioning is independent of Ruby’s version numbers.

Contributor

shepmaster commented Dec 19, 2012

This isn't quite right:

I thought that Marshal data wasn't supported across different versions?

From the docs:

Marshaled data has major and minor version numbers stored along with the object information. In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number. [...] Marshal versioning is independent of Ruby’s version numbers.

@fidothe

This comment has been minimized.

Show comment
Hide comment
@fidothe

fidothe Dec 19, 2012

And in this case, examining the Marshal major/minor version numbers are the same (from examining the first two bytes of features.mar and the first two bytes of Marshal.dump("text"). AFAICT JRuby should be able to read this dump.

fidothe commented Dec 19, 2012

And in this case, examining the Marshal major/minor version numbers are the same (from examining the first two bytes of features.mar and the first two bytes of Marshal.dump("text"). AFAICT JRuby should be able to read this dump.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

The test for encoding when unmarshaling strings seems suspicious to me. It tests for count >= 1, but count has to be >= 1 in order for the loop to execute. Or am I going crazy?

I've changed that locally to be > 1 and am running the tests.

Contributor

shepmaster commented Dec 19, 2012

The test for encoding when unmarshaling strings seems suspicious to me. It tests for count >= 1, but count has to be >= 1 in order for the loop to execute. Or am I going crazy?

I've changed that locally to be > 1 and am running the tests.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

Here's a smaller reproduction:

In Ruby 1.9.1-p431, via irb -E US-ASCII:US-ASCII

x = {"a" => 1.0, "b" => 2.0}
File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }

In JRuby 1.7.1 (1.9 mode)

File.open('191-hash.dmp') { |f| Marshal.load(f) }
# => ArgumentError: invalid encoding in marshaling stream: f
Contributor

shepmaster commented Dec 19, 2012

Here's a smaller reproduction:

In Ruby 1.9.1-p431, via irb -E US-ASCII:US-ASCII

x = {"a" => 1.0, "b" => 2.0}
File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }

In JRuby 1.7.1 (1.9 mode)

File.open('191-hash.dmp') { |f| Marshal.load(f) }
# => ArgumentError: invalid encoding in marshaling stream: f
@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

I had to go back to Ruby 1.9.1 because 1.9.2 introduced a shortcut for UTF-8 / US-ASCII encoding (E). The original marshal file goes out of its way to say US-ASCII.

Contributor

shepmaster commented Dec 19, 2012

I had to go back to Ruby 1.9.1 because 1.9.2 introduced a shortcut for UTF-8 / US-ASCII encoding (E). The original marshal file goes out of its way to say US-ASCII.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

You also need to have two items in the hash. I'm not super familiar with the Marshal format, but my guess from looking at the hex is that the second key doesn't specify the encoding to save space, and it's expected to just pick it up from the previous one. Pure speculation on my part though.

Contributor

shepmaster commented Dec 19, 2012

You also need to have two items in the hash. I'm not super familiar with the Marshal format, but my guess from looking at the hex is that the second key doesn't specify the encoding to save space, and it's expected to just pick it up from the previous one. Pure speculation on my part though.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 19, 2012

Contributor

The existing tests pass with the change to make it count > 0, but beyond that I'm not sure if this is the right path to take.

Contributor

shepmaster commented Dec 19, 2012

The existing tests pass with the change to make it count > 0, but beyond that I'm not sure if this is the right path to take.

shepmaster added a commit to shepmaster/jruby that referenced this issue Dec 29, 2012

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 30, 2012

Contributor

OK, my failed tests led me to dig deeper. Here's what I think is happening now:

When MRI writes an encoding that is not US-ASCII / UTF-8, it writes it as a normal String. This means that it can use the LINK type to reference previously written strings. See the end of MRI's marshal.c w_encoding function.

JRuby does not treat the encoding equivalent to a marshaled String, instead it ignores the string type marker " and then reads/writes a set of bytes straight. See MarshalStream writeEncoding and UnmarshalStream defaultVariablesUnmarshal.

This means you can reproduce with a modern MRI, like ruby 1.9.3p327:

x = ['a', 'b'].map {|s| s.force_encoding('Shift_JIS')}
File.open('193-array', 'w') {|f| Marshal.dump(x, f)}

It's important that the two strings not be the same, otherwise they will be marked as total duplicates and will be handled correctly.

Contributor

shepmaster commented Dec 30, 2012

OK, my failed tests led me to dig deeper. Here's what I think is happening now:

When MRI writes an encoding that is not US-ASCII / UTF-8, it writes it as a normal String. This means that it can use the LINK type to reference previously written strings. See the end of MRI's marshal.c w_encoding function.

JRuby does not treat the encoding equivalent to a marshaled String, instead it ignores the string type marker " and then reads/writes a set of bytes straight. See MarshalStream writeEncoding and UnmarshalStream defaultVariablesUnmarshal.

This means you can reproduce with a modern MRI, like ruby 1.9.3p327:

x = ['a', 'b'].map {|s| s.force_encoding('Shift_JIS')}
File.open('193-array', 'w') {|f| Marshal.dump(x, f)}

It's important that the two strings not be the same, otherwise they will be marked as total duplicates and will be handled correctly.

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Dec 30, 2012

Contributor

Both marshaling and unmarshaling are affected in the same way. This means JRuby cannot read certain MRI data, but MRI should be able to read JRuby data, although it will be bigger than necessary. I did not test this however.

Contributor

shepmaster commented Dec 30, 2012

Both marshaling and unmarshaling are affected in the same way. This means JRuby cannot read certain MRI data, but MRI should be able to read JRuby data, although it will be bigger than necessary. I did not test this however.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jul 24, 2013

Member

I tested this on master and it seems to work ok:

$ irb -E US-ASCII:US-ASCII
irb(main):001:0> x = {"a" => 1.0, "b" => 2.0}
=> {"a"=>1.0, "b"=>2.0}
irb(main):002:0> File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }
=> #<File:191-hash.dmp (closed)>

So I'm going to optimistically mark this as fixed (unsure of release, so I'll go with 1.7.5).

Member

headius commented Jul 24, 2013

I tested this on master and it seems to work ok:

$ irb -E US-ASCII:US-ASCII
irb(main):001:0> x = {"a" => 1.0, "b" => 2.0}
=> {"a"=>1.0, "b"=>2.0}
irb(main):002:0> File.open('191-hash.dmp', 'w') { |f| Marshal.dump(x, f) }
=> #<File:191-hash.dmp (closed)>

So I'm going to optimistically mark this as fixed (unsure of release, so I'll go with 1.7.5).

@headius headius closed this Jul 24, 2013

@shepmaster

This comment has been minimized.

Show comment
Hide comment
@shepmaster

shepmaster Jul 24, 2013

Contributor

IIRC, this was fixed by commit 82cda78, which doesn't seem to have any merge notes. Maybe it was cherry-picked? If so, should be fixed back to 1.7.3.

Contributor

shepmaster commented Jul 24, 2013

IIRC, this was fixed by commit 82cda78, which doesn't seem to have any merge notes. Maybe it was cherry-picked? If so, should be fixed back to 1.7.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment