Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Closed
trejkaz opened this Issue Jan 1, 2015 · 10 comments

Comments

Projects
None yet
2 participants
@trejkaz

trejkaz commented Jan 1, 2015

The following tests fail with -Dfile.encoding=windows-1252 but pass with -Dfile.encoding=UTF-8 :

import java.io.StringWriter;
import java.io.Writer;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

Most likely, the failure output you get will be confusing as well:

java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"

The "Expected" line is "?????" because Java is encoding the output as windows-1252.

The "but" line is "?????" because JRuby has encoded the strings to windows-1252 internally and then written and returned the question marks. I find it particularly odd that it would do this, both because the script is passed as a string directly from Java in the first place, but also because the script itself clearly says the strings are UTF-8.

This was JRUBY-4890 on the old tracker. The script had to be updated a bit to have a #encoding: UTF-8 directive because JRuby now complains if you omit it.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 24, 2015

Member

I added this test on JRuby 1.7 HEAD and master HEAD and it passed in both places. I also confirmed that file.encoding defaults to Cp1252. Am I doing something wrong?

Member

headius commented Feb 24, 2015

I added this test on JRuby 1.7 HEAD and master HEAD and it passed in both places. I also confirmed that file.encoding defaults to Cp1252. Am I doing something wrong?

@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Feb 25, 2015

Issue still occurs in 1.7.13 (what I had on my home computer when I got home), 1.7.19 (downloaded now) and 9.0.0.0.pre1 (downloaded now). I verified all three on OSX and specified the -Dfile.encoding system property on the command-line.

The full command-lines I used:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.13.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.19.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-9.0.0.0.pre1.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Of course in all cases, removing the file.encoding parameter makes the test pass.

Admittedly I had to fix compilation of the test itself as well, so here it is for completeness:

import java.io.Writer;
import java.io.StringWriter;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

trejkaz commented Feb 25, 2015

Issue still occurs in 1.7.13 (what I had on my home computer when I got home), 1.7.19 (downloaded now) and 9.0.0.0.pre1 (downloaded now). I verified all three on OSX and specified the -Dfile.encoding system property on the command-line.

The full command-lines I used:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.13.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.19.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-9.0.0.0.pre1.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Of course in all cases, removing the file.encoding parameter makes the test pass.

Admittedly I had to fix compilation of the test itself as well, so here it is for completeness:

import java.io.Writer;
import java.io.StringWriter;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}
@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Feb 25, 2015

Managed to build trunk, same deal:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:junit-4.12.jar:jruby-complete-9.0.0.0-SNAPSHOT.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters
JUnit version 4.12
.E.E
Time: 2.434
There were 2 failures:
1) testCharacterEncodingViaScriptContainer(TestUnicodeCharacters)
java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
...

trejkaz commented Feb 25, 2015

Managed to build trunk, same deal:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:junit-4.12.jar:jruby-complete-9.0.0.0-SNAPSHOT.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters
JUnit version 4.12
.E.E
Time: 2.434
There were 2 failures:
1) testCharacterEncodingViaScriptContainer(TestUnicodeCharacters)
java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
...
@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Mar 14, 2016

Verifying that this not only still occurs on 9.0.5.0, but now it occurs for me on OSX under IDEA 15, so one of those or something else updated and now it occurs in a new situation. Maybe this means others can now confirm the issue, so perhaps the chance of getting a fix will increase?

Current content of our test case:

@Test
public void testUnicodeStringOutputDirectlyViaScriptEngine() throws Exception
{
    String orig = "\u30C6\u30B9\u30C8";

    ScriptEngine engine = new JRubyEngineFactory().getScriptEngine();

    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);

    Object returnedResult = engine.eval("#encoding: utf-8\n" +
                                        "str = \"" + orig + "\" ; puts str ; str\n", context);

    assertEquals("Wrong result returned", orig, returnedResult);
    assertEquals("Wrong result printed", orig, writer.toString().trim());
}

The failure looks like this:

org.junit.ComparisonFailure: Wrong result returned 
Expected :???
Actual   :???

The test is apparently running with -Dfile.encoding=US-ASCII so IDEA refuses to show the text. If I change it to UTF-8 then it passes, so I can't get sensible output which shows the expected text. But you can look at the test case and see that I'm not expecting question marks, at least.

If you comment out the "Wrong result returned" check, it fails on "Wrong result printed".

trejkaz commented Mar 14, 2016

Verifying that this not only still occurs on 9.0.5.0, but now it occurs for me on OSX under IDEA 15, so one of those or something else updated and now it occurs in a new situation. Maybe this means others can now confirm the issue, so perhaps the chance of getting a fix will increase?

Current content of our test case:

@Test
public void testUnicodeStringOutputDirectlyViaScriptEngine() throws Exception
{
    String orig = "\u30C6\u30B9\u30C8";

    ScriptEngine engine = new JRubyEngineFactory().getScriptEngine();

    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);

    Object returnedResult = engine.eval("#encoding: utf-8\n" +
                                        "str = \"" + orig + "\" ; puts str ; str\n", context);

    assertEquals("Wrong result returned", orig, returnedResult);
    assertEquals("Wrong result printed", orig, writer.toString().trim());
}

The failure looks like this:

org.junit.ComparisonFailure: Wrong result returned 
Expected :???
Actual   :???

The test is apparently running with -Dfile.encoding=US-ASCII so IDEA refuses to show the text. If I change it to UTF-8 then it passes, so I can't get sensible output which shows the expected text. But you can look at the test case and see that I'm not expecting question marks, at least.

If you comment out the "Wrong result returned" check, it fails on "Wrong result printed".

@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Mar 14, 2016

In the debugger, in EmbedRubyRuntimeAdapterImpl#runParser, I have:

* input = "#encoding: utf-8\nstr = "テスト" ; puts str ; str\n"
* ...
* node = {org.jruby.ast.RootNode@6656} "(RootNode 1, (BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1)))"
    * scope = {org.jruby.runtime.scope.ManyVarsDynamicScope@6653} "Static Type[819314039]: block [str=null]\n  Static Type[650867604]: local []"
    * staticScope = {org.jruby.parser.StaticScope@6665} "StaticScope(EVAL):[str]"
    * bodyNode = {org.jruby.ast.BlockNode@6666} "(BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1))"
        * list = {org.jruby.ast.Node[3]@6672} 
            * 0 = {org.jruby.ast.DAsgnNode@6675} "(DAsgnNode:str 1, (StrNode 1))"
                * name = "str"
                * location = 0
                * valueNode = {org.jruby.ast.StrNode@6692} "(StrNode 1)"
                    * value = {org.jruby.util.ByteList@6696} "???"
                        * bytes = {byte[15]@6699} 
                            * 0 = 63
                            * 1 = 63
                            * 2 = 63
                            * 3 = 0
                            * 4 = 0
                            * 5 = 0
                            * 6 = 0
                            * 7 = 0
                            * 8 = 0
                            * 9 = 0
                            * 10 = 0
                            * 11 = 0
                            * 12 = 0
                            * 13 = 0
                            * 14 = 0
                        * begin = 0
                        * realSize = 3
                        * encoding = {org.jcodings.specific.UTF8Encoding@6700} "UTF-8"
                        * hash = 0
                        * stringValue = "???"

So it has already been destroyed at parse-time, explaining why both the result and the output are wrong.

trejkaz commented Mar 14, 2016

In the debugger, in EmbedRubyRuntimeAdapterImpl#runParser, I have:

* input = "#encoding: utf-8\nstr = "テスト" ; puts str ; str\n"
* ...
* node = {org.jruby.ast.RootNode@6656} "(RootNode 1, (BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1)))"
    * scope = {org.jruby.runtime.scope.ManyVarsDynamicScope@6653} "Static Type[819314039]: block [str=null]\n  Static Type[650867604]: local []"
    * staticScope = {org.jruby.parser.StaticScope@6665} "StaticScope(EVAL):[str]"
    * bodyNode = {org.jruby.ast.BlockNode@6666} "(BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1))"
        * list = {org.jruby.ast.Node[3]@6672} 
            * 0 = {org.jruby.ast.DAsgnNode@6675} "(DAsgnNode:str 1, (StrNode 1))"
                * name = "str"
                * location = 0
                * valueNode = {org.jruby.ast.StrNode@6692} "(StrNode 1)"
                    * value = {org.jruby.util.ByteList@6696} "???"
                        * bytes = {byte[15]@6699} 
                            * 0 = 63
                            * 1 = 63
                            * 2 = 63
                            * 3 = 0
                            * 4 = 0
                            * 5 = 0
                            * 6 = 0
                            * 7 = 0
                            * 8 = 0
                            * 9 = 0
                            * 10 = 0
                            * 11 = 0
                            * 12 = 0
                            * 13 = 0
                            * 14 = 0
                        * begin = 0
                        * realSize = 3
                        * encoding = {org.jcodings.specific.UTF8Encoding@6700} "UTF-8"
                        * hash = 0
                        * stringValue = "???"

So it has already been destroyed at parse-time, explaining why both the result and the output are wrong.

@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Mar 14, 2016

Ruby.java line 2750 appears to be the culprit:

public Node parseEval(String content, String file, DynamicScope scope, int lineNumber) {
    addEvalParseToStats();
    return parser.parse(file, content.getBytes(), scope, new ParserConfiguration(this, lineNumber, false, false, config));
}

Call to String#getBytes() with no Charset parameter. Have you ever considered running the forbidden-apis tool over JRuby? It would have found this one. :)

Well, I can't figure out how this was supposed to work, but basically by the time getBytes() is called, the string is now "#encoding: utf-8\nstr = "???" ; puts str ; str\n", so its Unicode content was destroyed before the parser could even get to the line telling it not to do that.

trejkaz commented Mar 14, 2016

Ruby.java line 2750 appears to be the culprit:

public Node parseEval(String content, String file, DynamicScope scope, int lineNumber) {
    addEvalParseToStats();
    return parser.parse(file, content.getBytes(), scope, new ParserConfiguration(this, lineNumber, false, false, config));
}

Call to String#getBytes() with no Charset parameter. Have you ever considered running the forbidden-apis tool over JRuby? It would have found this one. :)

Well, I can't figure out how this was supposed to work, but basically by the time getBytes() is called, the string is now "#encoding: utf-8\nstr = "???" ; puts str ; str\n", so its Unicode content was destroyed before the parser could even get to the line telling it not to do that.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 14, 2016

Member

@trejkaz You may be right about the culprit here. Sorry this one slipped through the cracks.

We do use file.encoding implicitly to encode Java strings into Ruby strings, assuming that's what the user would want. But I think perhaps that's incorrect logic; the user will probably want Strings coming from Java to get encoded the same way multibyte strings read in via IO get encoded, and that generally means UTF-8.

Member

headius commented Mar 14, 2016

@trejkaz You may be right about the culprit here. Sorry this one slipped through the cracks.

We do use file.encoding implicitly to encode Java strings into Ruby strings, assuming that's what the user would want. But I think perhaps that's incorrect logic; the user will probably want Strings coming from Java to get encoded the same way multibyte strings read in via IO get encoded, and that generally means UTF-8.

@headius headius added this to the JRuby 9.1.0.0 milestone Mar 14, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 14, 2016

Member

I have a fix for the bytes going in, but the output being returned is still getting mangled.

Member

headius commented Mar 14, 2016

I have a fix for the bytes going in, but the output being returned is still getting mangled.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 14, 2016

Member

Ok, two fixes for this one:

  • The getBytes @trejkaz pointed out was not encoding incoming Strings properly, and
  • The construction of a WriterOutputStream in org.jruby.embed.jsr223.Utils was not decoding output properly.

Fix coming. @trejkaz Can we incorporate your test case into our suite? If you have any more you'd like to add, we'd really appreciate it!

Member

headius commented Mar 14, 2016

Ok, two fixes for this one:

  • The getBytes @trejkaz pointed out was not encoding incoming Strings properly, and
  • The construction of a WriterOutputStream in org.jruby.embed.jsr223.Utils was not decoding output properly.

Fix coming. @trejkaz Can we incorporate your test case into our suite? If you have any more you'd like to add, we'd really appreciate it!

headius added a commit that referenced this issue Mar 14, 2016

Fix up how our 223 engine encodes input and decodes output.
* Incoming scripts should be decoded from String to byte[] using
  default internal encoding, rather than trusting JDK's
  file.encoding to be appropriate.
* Outgoing streams wrapping writers should decode strings based
  on current default internal encoding.

Fixes #2403

@headius headius closed this in af965d1 Mar 14, 2016

@headius headius self-assigned this Mar 14, 2016

@trejkaz

This comment has been minimized.

Show comment
Hide comment
@trejkaz

trejkaz Mar 14, 2016

Feel free to include that test in the JRuby suite. It will be good for us as it stops a regression creeping in.

As far as others... I just had a look through all our tests and there is nothing else about encoding in our suite, other than a second nearly identical test which I'll probably remove now that I have found it.

Thanks for the quick fix! :D

trejkaz commented Mar 14, 2016

Feel free to include that test in the JRuby suite. It will be good for us as it stops a regression creeping in.

As far as others... I just had a look through all our tests and there is nothing else about encoding in our suite, other than a second nearly identical test which I'll probably remove now that I have found it.

Thanks for the quick fix! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment