Skip to content

Strange encoding differences with UTF-8 strings #1474

Closed
jcoyne opened this Issue Feb 2, 2014 · 8 comments

3 participants

@jcoyne
jcoyne commented Feb 2, 2014
irb(main):010:0* test = "/fixtures/公共"
=> "/fixtures/公共"
irb(main):011:0> Regexp.escape(test).encoding
=> #<Encoding:UTF-8>
irb(main):013:0> test = "/fixtures/"
=> "/fixtures/"
irb(main):014:0> Regexp.escape("#{test}/公共").encoding
=> #<Encoding:US-ASCII>

I'm not sure why one is Encoding::UTF-8 and the other is Encoding:US-ASCII. I suspect this is causing a failure in this rails test: https://github.com/rails/rails/blob/35e56f6fa535288abf1de7fa70c2faed5e2d88ff/actionpack/test/dispatch/static_test.rb#L164

ArgumentError: regexp preprocess failed: invalid multibyte character
    from /Users/justin/workspace/rails/actionpack/lib/action_dispatch/middleware/static.rb:9:in `initialize'
    from /Users/justin/workspace/rails/actionpack/lib/action_dispatch/middleware/static.rb:52:in `initialize'
@enebo
JRuby Team member
enebo commented Feb 2, 2014

@jcoyne I believe this is the similar in nature to #1392 but oddly that one seems like it must be a parser issue whereas I am less sure in your example. What is your default external encoding in irb in that session?

@jcoyne
jcoyne commented Feb 2, 2014
Encoding.default_external
=> #<Encoding:UTF-8>
@enebo
JRuby Team member
enebo commented Feb 4, 2014

Looked at this shortly this afternoon and I am jotting a note or two. This does not fail on master. Master is 2.1 compat and the main obvious difference is default external is UTF-8 by default. regexp.quote is not really the issue here. This is basically DNode concatting these two strings together ends up ascii instead of utf-8. So this is probably same issue as 1392 more or less. With this program I cannot get it to work on 1.7 branch (even tried jruby --1.9 -Eutf-8 -Ku -X-C snippets/rrr.rb):

# -*- coding: utf-8 -*-
test = "/fixtures/"
p "#{test}/公共".encoding
p Regexp.quote("#{test}/公共").encoding
@jkl1337
jkl1337 commented Feb 5, 2014

For what is is worth, I bisected this to a6411a8, a fix for #1255.
There are some significant changes there, and I didn't have the time yet to trace it further.

@jkl1337
jkl1337 commented Feb 5, 2014

DNode.appendToString is one culprit. The isSameEncoding test ignores the code range, so the resultant dynamic string with an ASCII prefix ends up with 7-bit code range flag. Adding a test prevents this.
I also do not understand how CompoundString in the compiler is correct but I do not have a failing test case. I noticed that the minimal test case above is sensitive to "-X-C".

appendToString with the following seems to work.

        if (node instanceof StrNode) {
            StrNode strNode = (StrNode)node;
            if (!is19() || isSameEncoding(strNode) && strNode.getCodeRange() == string.getCodeRange()) {
                string.getByteList().append(strNode.getValue());
            } else {
                string.cat19(strNode.getValue(), strNode.getCodeRange());
            }
        } else if (node instanceof EvStrNode) {
@enebo
JRuby Team member
enebo commented Feb 5, 2014

Actually I wrote that up in a hurry to take notes and the notes make no sense. The encoding of the DNode is in fact UTF-8 to begin with so in this case it is quote converting to US-ASCII. :)

@enebo
JRuby Team member
enebo commented Feb 5, 2014

Oh and I guess I should have refreshed this page before writing comments...

@enebo
JRuby Team member
enebo commented Feb 5, 2014

I am applying @jkl1337 fix (Thanks!). It is probably not a perfect fix in the sense we do not have this same on master and it works but master it slowly growing apart and this does not seem to regress anything. fwiw, we keep getting mostly encoding bugs where the encoding is proper but we did not appropriate reset/clear the coderange. Coderange is definitely something we might want to reexamine for 9k.

This did not fix #1392 unfortunately.

@enebo enebo closed this Feb 5, 2014
@enebo enebo added this to the JRuby 1.7.11 milestone Feb 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.