Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

Closed
jeremyevans opened this Issue May 24, 2018 · 20 comments

Comments

Projects
None yet
3 participants
@jeremyevans
Copy link
Contributor

jeremyevans commented May 24, 2018

Environment

jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]

Also verified on:

jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [mswin32-x86_64]

Unlikely to be operating system or java version specific.

Expected Behavior

String deduplication via String#-@ does not work the same as CRuby 2.5 for non-literal strings:

$ ruby -e "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-openbsd]
true

Actual Behavior

$ jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false
@headius

This comment has been minimized.

Copy link
Member

headius commented May 24, 2018

Do you have any other environment that might be affecting this? I can't reproduce on MacOS at least:

$ jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.1.0-SNAPSHOT (2.5.0) 2018-05-24 98abdb4 Java HotSpot(TM) 64-Bit Server VM 25.161-b12 on 1.8.0_161-b12 +jit [darwin-x86_64]
true
@headius

This comment has been minimized.

Copy link
Member

headius commented May 24, 2018

Our logic does appear to match MRI...if the string is already frozen it is just returned, otherwise it is frozen and deduplicated.

static VALUE
str_uminus(VALUE str)
{
    if (OBJ_FROZEN(str)) {
        return str;
    }
    else {
        return rb_fstring(str);
    }
}
@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 24, 2018

I also do not see this on fedora:

./bin/jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 Java HotSpot(TM) 64-Bit Server VM 25.172-b11 on 1.8.0_172-b11 +jit [linux-x86_64]
true
@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 24, 2018

I don't think I have an environment that affects this, since I tested on both OpenBSD and on Windows and got the same results on both. Weird that it would work on Mac OS and Linux but not on Windows or OpenBSD. Can you test on Windows?

@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 24, 2018

@jeremyevans WOW...It does not work on Windows (and Java 9...although I checked 8 and 10 on linux and they both work). WTF

@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 24, 2018

@jeremyevans ok I take that back a little bit...It does work in a script but fails with -e. So perhaps this is some icky CMD parsing thing in that case. That would not explain why FreeBSD fails though. Also it seems like it should work in CMD command line too. This is weird.

@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 24, 2018

I don't think it is limited to -e. I noticed the issue because Sequel's specs failed with JRuby 9.2.0.0, and -e was definitely not involved there.

I think in Sequel's specs, the code is a little different, something akin to:

a = :a.to_s.dup
b = :a.to_s.dup
p ((-a).object_id == (-b).object_id)

That returns true in ruby 2.5.1 and false in jruby 9.2.0.0 (both in Windows and OpenBSD):

$ ruby -v fstring_test.rb
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-openbsd]
true
$ jruby -v fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false
@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

There are two launchers for JRuby on non-unix platforms: the bash script and the native executable. On Windows there's only the native executable, so that makes me suspect a problem in there. There's another bug about improperly quoting (I believe) where I mention a Win32 call that lets you get at raw untweaked command line arguments.

Can you try both launchers on OpenBSD at least?

@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 25, 2018

OpenBSD uses the jruby-launcher gem to build a native executable, as otherwise you can't use jruby in a shebang line (unlike Linux, on OpenBSD, the shebang line must refer to a native executable). But I can probably test the bash script manually tomorrow.

Note that as my previous comment shows, this is not (only) a command line parsing issue.

@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 25, 2018

Same results with the bash script:

$ PATH=/usr/local/jdk-1.8.0/bin/:$PATH bin/jruby.bash -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

I've found the command line issue appears to be encoding related. If I force a UTF-8 encoding, it returns true:

$ LC_CTYPE=en_US.UTF-8 jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
true

However, forcing a UTF-8 encoding does not fix the fstring_test.rb example above:

$ LC_CTYPE=en_US.UTF-8 jruby -v fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false
@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 25, 2018

@jeremyevans maybe try -X-Dfile.coding=UTF-8 and see if this changes anything. I still don't get it but on windows perhaps file.coding is window31j or something like that...I would still think these would dedup properly though as that encoding.

@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 25, 2018

With that option I get jruby: invalid extended option -Dfile.coding=UTF-8 (-X will list valid options) followed by the usage output (both on OpenBSD and on Windows).

@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 25, 2018

haha @jeremyevans sorry I made two mistakes: -J-Dfile.encoding=UTF-8.

@jeremyevans

This comment has been minimized.

Copy link
Contributor Author

jeremyevans commented May 25, 2018

@enebo That appears to make the same changes as LC_CTYPE=en_US.UTF-8, fixing the command line example, but still not having the fstring_test.rb example work:

$ jruby -J-Dfile.encoding=UTF-8 -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
true
$ jruby -J-Dfile.encoding=UTF-8 -v ../fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false
@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

I suspect the symbol example (fstring.rb) fails because the string associated with the symbol is getting interned with another encoding. Even if the string contents are the same, if the encodings differ we don't deup.

When I put some logging into our dedup logic, I see the following:

+ is UTF-8 and + is US-ASCII
- is UTF-8 and - is US-ASCII
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
method is ASCII-8BIT and method is UTF-8
 is ASCII-8BIT and  is UTF-8
a is UTF-8 and a is US-ASCII
a is UTF-8 and a is US-ASCII
false

And if I disable booting our Ruby kernel altogether with -Xdebug.parser, it works properly. So somewhere in the libraries that load at boot this particular string is getting deduped as UTF-8.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

Confirmed the script version works if you use something other than 'a'.

All of the strings that report encoding mismatches above are 7-bit ASCII (and several are blank) so I'm not sure why they're getting in here as ASCII-8BIT and UTF-8. I did notice that RubyString.hashCode does not consider encoding if the strings are 7-bit and the encoding is ASCII-compatible, which allows a US-ASCII 'a' to retrieve the UTF-8 'a' from the cache.

We could modify the hashCode logic to always consider encoding, and then we'd have no conflicts; but we might have multiple copies of some 7-bit strings in there.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

MRI does appear to allow multiple entries for the same bytes but different encoding to live in the fstring cache.

a1 = "a".force_encoding("ASCII-8BIT")
a2 = "a".force_encoding("UTF-8")
a1p = (-a1)
a2p = (-a2)
p a1p.object_id == (-a1.dup).object_id # true
p a2p.object_id == (-a2.dup).object_id # true
p a1p.encoding # ASCII-8BIT
p a2p.encoding # UTF-8
@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

MRI does not consider the encoding for a 7-bit string's hash code, but for the fstring cache it does specify a comparison function that rejects mismatched encodings. So they'll go in the same bucket but both encodings are preserved. I'll see about making the same change in JRuby.

@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

For reference, here's the comparison function MRI uses for fstring cache:

static int
fstring_cmp(VALUE a, VALUE b)
{
    long alen, blen;
    const char *aptr, *bptr;
    RSTRING_GETMEM(a, aptr, alen);
    RSTRING_GETMEM(b, bptr, blen);
    return (alen != blen ||
            ENCODING_GET(a) != ENCODING_GET(b) ||
            memcmp(aptr, bptr, alen) != 0);
}
@headius

This comment has been minimized.

Copy link
Member

headius commented May 25, 2018

Hmm, this may be a bit trickier than I'd hoped.

All the JDK collections, including the ConcurrentHashMap we're using, default to calling equals and there's no way to replace that without rewriting the collection. RubyString#equals does not consider encoding (in MRI, only if 7bit; in JRuby it seems to always be ignored if the bytes are the same).

So there's a bit more gymnastics required here.

headius added a commit to headius/jruby that referenced this issue May 26, 2018

Consider encoding when deduplicating strings. Fixes jruby#5190.
CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.

In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.

The additional overhead of the cache is:

* The wrapper object and indirecting through it.
* Constructing a wrapper object (only when the previous lookup
  added a new wrapper or this is the first lookup).
* Accessing a previously cached wrapper via a thread-local
  (inverse of the above conditions)

In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.

@enebo enebo closed this in #5198 Sep 19, 2018

@enebo enebo added this to the JRuby 9.2.1.0 milestone Nov 1, 2018

headius added a commit to headius/sequel that referenced this issue Mar 17, 2019

Remove JRuby check from excluded dedup spec.
This spec passes fine on JRuby since jruby/jruby#5190.

jeremyevans added a commit to jeremyevans/sequel that referenced this issue Mar 17, 2019

Remove JRuby check from excluded dedup spec.
This spec passes fine on JRuby since jruby/jruby#5190.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.