Encoding Problems on Windows when writing UTF-8 files in 1.9 mode #502

Closed
rurounijones opened this Issue Jan 18, 2013 · 20 comments

Comments

Projects
None yet
4 participants
@rurounijones

The code in the following gist: https://gist.github.com/4545595 will fail on Windows.

JRuby 1.7.1, Windows 2008 R2, JDK 1.7 U11 in 1.9 mode
JRuby 1.7.2, Windows XP SP3, bundled JRE in 1.9 mode

The resulting file will be full of gibberish instead of properly formatted UTF-8 Characters. Running JRuby in 1.8 mode results in a correctly formatted UTF-8 file.

JRuby on Linux works in both 1.8 and 1.9 modes.

@matthauck

This comment has been minimized.

Show comment Hide comment
@matthauck

matthauck Jan 18, 2013

Contributor

I'm experiencing this as well: https://gist.github.com/4563666

Contributor

matthauck commented Jan 18, 2013

I'm experiencing this as well: https://gist.github.com/4563666

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Jan 18, 2013

Member

Wow that is super basic (and seemingly not OS-specific) and it works on MacOS. I wonder if we have some weird conditional somewhere messing this up. On thing to try as input to this issue is setting -J-Dfile.encoding=UTF-8 and seeing if it is still garbage? We should not be using Java's locale encoding for any of this but if that fixes the issue we have it leaking into our IO logic somewhere.

Member

enebo commented Jan 18, 2013

Wow that is super basic (and seemingly not OS-specific) and it works on MacOS. I wonder if we have some weird conditional somewhere messing this up. On thing to try as input to this issue is setting -J-Dfile.encoding=UTF-8 and seeing if it is still garbage? We should not be using Java's locale encoding for any of this but if that fixes the issue we have it leaking into our IO logic somewhere.

@matthauck

This comment has been minimized.

Show comment Hide comment
@matthauck

matthauck Jan 19, 2013

Contributor

I actually was running it with. -J-Dfile.encoding=utf-8

-Matt

Sent from my iPhone

On Jan 18, 2013, at 12:19 PM, Thomas E Enebo notifications@github.com wrote:

Wow that is super basic (and seemingly not OS-specific) and it works on MacOS. I wonder if we have some weird conditional somewhere messing this up. On thing to try as input to this issue is setting -J-Dfile.encoding=UTF-8 and seeing if it is still garbage? We should not be using Java's locale encoding for any of this but if that fixes the issue we have it leaking into our IO logic somewhere.


Reply to this email directly or view it on GitHub.

Contributor

matthauck commented Jan 19, 2013

I actually was running it with. -J-Dfile.encoding=utf-8

-Matt

Sent from my iPhone

On Jan 18, 2013, at 12:19 PM, Thomas E Enebo notifications@github.com wrote:

Wow that is super basic (and seemingly not OS-specific) and it works on MacOS. I wonder if we have some weird conditional somewhere messing this up. On thing to try as input to this issue is setting -J-Dfile.encoding=UTF-8 and seeing if it is still garbage? We should not be using Java's locale encoding for any of this but if that fixes the issue we have it leaking into our IO logic somewhere.


Reply to this email directly or view it on GitHub.

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 21, 2013

Unfortunately I can confirm that running it with -J-Dfile.encoding=utf-8 makes no difference.

Unfortunately I can confirm that running it with -J-Dfile.encoding=utf-8 makes no difference.

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 24, 2013

I hate to be that guy but is there any way of expediting this? it is a massive blocker for us because we have to use windows servers and we have to write UTF-8 files. Unfortunately we cannot use 1.8 mode.

I hate to be that guy but is there any way of expediting this? it is a massive blocker for us because we have to use windows servers and we have to write UTF-8 files. Unfortunately we cannot use 1.8 mode.

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Jan 24, 2013

Member

I will be looking at this today. Hopefully it won't be a very invasive fix since this works on MacOS (we do have some windows-only logic in places in IO).

Member

enebo commented Jan 24, 2013

I will be looking at this today. Hopefully it won't be a very invasive fix since this works on MacOS (we do have some windows-only logic in places in IO).

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 24, 2013

Thank you very very much, if there is any problem in replicating or you need more environment information then I will be around a bit starting from around 8 hours from now.

[EDIT] It is only now that I realise how dumb the above line is since your average work day lasts 8 hours and, due to timezones, you will probably be finished before I am back. Just trying to say that I will respond as quickly as I can if more information is needed.

Thank you very very much, if there is any problem in replicating or you need more environment information then I will be around a bit starting from around 8 hours from now.

[EDIT] It is only now that I realise how dumb the above line is since your average work day lasts 8 hours and, due to timezones, you will probably be finished before I am back. Just trying to say that I will respond as quickly as I can if more information is needed.

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Jan 24, 2013

Member

I think I will need more help on this and will try again tomorrow morning when I get more feedback. I tried both of your scripts and the first thing I did was compare against MRI1.9.3.

@matthauck - Here are my results: https://gist.github.com/4627873
At the top we see that both MRI and JRuby both doing the wrong thing the same way. In the second output, I tell both Rubies to use UTF-8 as a default external and then things work as expected. I took this script a step further and also made sure the input string matches the output string and they agree. The only oddity here is printing to the console we see different characters. This gives me pause...BUT...

@rurounijones - When I run your script using -Eutf-8 and look at the output in notepad (specifying notepad load it as UTF-8 document) I see the same Japanese testo 123.

This second test displaying proper text in notepad makes me think we have some issue in printing to a console, but not with how we are reading and writing these characters in IO. I would not think the console issue could have anything to do with Rails not displaying things properly.

I guess if you could both make sure you run with -Eutf-8 (I am guessing MRI uses Windows-1252 or similar depending on geography) and see what problems we can get after doing that. Someone said they could not believe anyone has used JRuby for multi-byte characters on Windows before and now I am starting to wonder if the same cannot also be said of MRI? If -Eutf-8 is required to do Rails on windows I think it would be plastered all over the internet...

Member

enebo commented Jan 24, 2013

I think I will need more help on this and will try again tomorrow morning when I get more feedback. I tried both of your scripts and the first thing I did was compare against MRI1.9.3.

@matthauck - Here are my results: https://gist.github.com/4627873
At the top we see that both MRI and JRuby both doing the wrong thing the same way. In the second output, I tell both Rubies to use UTF-8 as a default external and then things work as expected. I took this script a step further and also made sure the input string matches the output string and they agree. The only oddity here is printing to the console we see different characters. This gives me pause...BUT...

@rurounijones - When I run your script using -Eutf-8 and look at the output in notepad (specifying notepad load it as UTF-8 document) I see the same Japanese testo 123.

This second test displaying proper text in notepad makes me think we have some issue in printing to a console, but not with how we are reading and writing these characters in IO. I would not think the console issue could have anything to do with Rails not displaying things properly.

I guess if you could both make sure you run with -Eutf-8 (I am guessing MRI uses Windows-1252 or similar depending on geography) and see what problems we can get after doing that. Someone said they could not believe anyone has used JRuby for multi-byte characters on Windows before and now I am starting to wonder if the same cannot also be said of MRI? If -Eutf-8 is required to do Rails on windows I think it would be plastered all over the internet...

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 25, 2013

Hello enebo

I just re-ran the tests using ruby 1.9.3 and jruby again and here are the results:

Ruby 1.9.3:                        Passed
Ruby 1.9.3 /w -Eutf8:              Passed

JRuby 1.7.1 (1.9 mode):            Failed
JRuby 1.7.1 (1.9 mode) /w -Eutf-8: Failed

JRuby 1.7.1 (1.8 mode):            Passed
JRuby 1.7.1 (1.8 mode) /w -Eutf-8: Passed

JRuby 1.6.8 (1.9 mode):            Passed
JRuby 1.6.8 (1.9 mode) /w -Eutf-8: Passed

JRuby 1.6.8 (1.8 mode):            Passed
JRuby 1.6.8 (1.8 mode) /w -Eutf-8: Passed

"Passed" in this case means that the file opens correctly in notepad++ and, after copying it to a linux machine, "file jtext.txt" returns "jtest.txt: UTF-8 Unicode text, with no line terminators". For the failed files it just comes up as "jtest.txt: data"

When correct the file is 24 bytes.
When incorrect the file is 8 bytes. (Actual text displayed in Kate is ƹ���"W although this text gets messed up by github). Interestingly when I open the file in the Kate editor on Kubuntu it's encoding auto-detection says the file encoding is ISO-8859-15 although how much we can trust this I do not know.

If you can give me the email address I can send you the broken file itself so you have a raw file to look at, (cannot post them easily on the web due to company firewall)

Windows 2008 R2

MRI: ruby 1.9.3p362 (2012-12-25) [i386-ming32]

JRuby 1.7.1:
1.9: jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on Java HotSpot(TM) 64-Bit Server VM 1.7.0_10-b18 [Windows Server 2008 R2-amd64]
1.8: jruby 1.7.1 (ruby-1.8.7p370) 2012-12-03 30a153b on Java HotSpot(TM) 64-Bit Server VM 1.7.0_10-b18 [Windows Server 2008 R2-amd64]

JRuby 1.6.8:
1.9: jruby 1.6.8 (ruby-1.9.2-p312) (2012-09-18 1772b40) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_10) [Windows Server 2008 R2-amd64-java]
1.8: jruby 1.6.8 (ruby-1.8.7-p357) (2012-09-18 1772b40) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_10) [Windows Server 2008 R2-amd64-java]

Java:
java version "1.7.0_10"
Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

I also tried running Matt's code using your command-line flags and got failures for every test.

Hello enebo

I just re-ran the tests using ruby 1.9.3 and jruby again and here are the results:

Ruby 1.9.3:                        Passed
Ruby 1.9.3 /w -Eutf8:              Passed

JRuby 1.7.1 (1.9 mode):            Failed
JRuby 1.7.1 (1.9 mode) /w -Eutf-8: Failed

JRuby 1.7.1 (1.8 mode):            Passed
JRuby 1.7.1 (1.8 mode) /w -Eutf-8: Passed

JRuby 1.6.8 (1.9 mode):            Passed
JRuby 1.6.8 (1.9 mode) /w -Eutf-8: Passed

JRuby 1.6.8 (1.8 mode):            Passed
JRuby 1.6.8 (1.8 mode) /w -Eutf-8: Passed

"Passed" in this case means that the file opens correctly in notepad++ and, after copying it to a linux machine, "file jtext.txt" returns "jtest.txt: UTF-8 Unicode text, with no line terminators". For the failed files it just comes up as "jtest.txt: data"

When correct the file is 24 bytes.
When incorrect the file is 8 bytes. (Actual text displayed in Kate is ƹ���"W although this text gets messed up by github). Interestingly when I open the file in the Kate editor on Kubuntu it's encoding auto-detection says the file encoding is ISO-8859-15 although how much we can trust this I do not know.

If you can give me the email address I can send you the broken file itself so you have a raw file to look at, (cannot post them easily on the web due to company firewall)

Windows 2008 R2

MRI: ruby 1.9.3p362 (2012-12-25) [i386-ming32]

JRuby 1.7.1:
1.9: jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on Java HotSpot(TM) 64-Bit Server VM 1.7.0_10-b18 [Windows Server 2008 R2-amd64]
1.8: jruby 1.7.1 (ruby-1.8.7p370) 2012-12-03 30a153b on Java HotSpot(TM) 64-Bit Server VM 1.7.0_10-b18 [Windows Server 2008 R2-amd64]

JRuby 1.6.8:
1.9: jruby 1.6.8 (ruby-1.9.2-p312) (2012-09-18 1772b40) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_10) [Windows Server 2008 R2-amd64-java]
1.8: jruby 1.6.8 (ruby-1.8.7-p357) (2012-09-18 1772b40) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_10) [Windows Server 2008 R2-amd64-java]

Java:
java version "1.7.0_10"
Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

I also tried running Matt's code using your command-line flags and got failures for every test.

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Jan 25, 2013

Member

@rurounijones Two things...I noticed you are running JRuby 1.7.1 and not 1.7.2. I just tried Matt's file with 1.7.2 and it was still broken. However, it is working on master. So, could I ask you to try master of JRuby against your stuff? We must have fixed something since 1.7.2 was out....

Member

enebo commented Jan 25, 2013

@rurounijones Two things...I noticed you are running JRuby 1.7.1 and not 1.7.2. I just tried Matt's file with 1.7.2 and it was still broken. However, it is working on master. So, could I ask you to try master of JRuby against your stuff? We must have fixed something since 1.7.2 was out....

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 25, 2013

I will do so as soon as I get back to the office on monday

I will do so as soon as I get back to the office on monday

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Jan 25, 2013

Member

Thanks @rurounijones. I think you may find this is fixed on master. @trejkaz has an interesting repro (which involves two levels of charsets - Java and Writing IO) and it works with master: https://gist.github.com/4631108

Member

enebo commented Jan 25, 2013

Thanks @rurounijones. I think you may find this is fixed on master. @trejkaz has an interesting repro (which involves two levels of charsets - Java and Writing IO) and it works with master: https://gist.github.com/4631108

@matthauck

This comment has been minimized.

Show comment Hide comment
@matthauck

matthauck Jan 25, 2013

Contributor

Yup. Confirmed it is fixed on master for mine.

On Friday, January 25, 2013 at 8:39 AM, Thomas E Enebo wrote:

Thanks @rurounijones (https://github.com/rurounijones). I think you may find this is fixed on master. @trejkaz (https://github.com/trejkaz) has an interesting repro (which involves two levels of charsets - Java and Writing IO) and it works with master: https://gist.github.com/4631108


Reply to this email directly or view it on GitHub (#502 (comment)).

Contributor

matthauck commented Jan 25, 2013

Yup. Confirmed it is fixed on master for mine.

On Friday, January 25, 2013 at 8:39 AM, Thomas E Enebo wrote:

Thanks @rurounijones (https://github.com/rurounijones). I think you may find this is fixed on master. @trejkaz (https://github.com/trejkaz) has an interesting repro (which involves two levels of charsets - Java and Writing IO) and it works with master: https://gist.github.com/4631108


Reply to this email directly or view it on GitHub (#502 (comment)).

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Jan 28, 2013

Confirmed that jruby-bin-1.7.3.dev snapshot released on 27-Jan-2013 works correctly for me.

It works with AND without the -Eutf-8 parameter.

Confirmed that jruby-bin-1.7.3.dev snapshot released on 27-Jan-2013 works correctly for me.

It works with AND without the -Eutf-8 parameter.

@BanzaiMan

This comment has been minimized.

Show comment Hide comment
@BanzaiMan

BanzaiMan Feb 13, 2013

Member

Looks like the master fixes this.

Member

BanzaiMan commented Feb 13, 2013

Looks like the master fixes this.

@BanzaiMan BanzaiMan closed this Feb 13, 2013

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Feb 13, 2013

I was waiting until a 1.7.3 candidate release to check it there and then close this ticket but while I think about it, is there a test for this in the code now to prevent a regression since this is a pretty big showstopper for some of us and I would hate to see it re-emerge.

I was waiting until a 1.7.3 candidate release to check it there and then close this ticket but while I think about it, is there a test for this in the code now to prevent a regression since this is a pretty big showstopper for some of us and I would hate to see it re-emerge.

@BanzaiMan

This comment has been minimized.

Show comment Hide comment
@BanzaiMan

BanzaiMan Feb 13, 2013

Member

@rurounijones Good point. I don't have access to a Windows VM; if you can come up with a spec, that'll be most definitely appreciated.

Member

BanzaiMan commented Feb 13, 2013

@rurounijones Good point. I don't have access to a Windows VM; if you can come up with a spec, that'll be most definitely appreciated.

@BanzaiMan BanzaiMan reopened this Feb 13, 2013

@rurounijones

This comment has been minimized.

Show comment Hide comment
@rurounijones

rurounijones Feb 13, 2013

We can probably make something using @matthauck 's code. I do not have time to do so for a while though (assuming I have the skill to do it). What would be easier? Leave this ticket open or close this ticket then create a new patch request with the spec?

We can probably make something using @matthauck 's code. I do not have time to do so for a while though (assuming I have the skill to do it). What would be easier? Leave this ticket open or close this ticket then create a new patch request with the spec?

@BanzaiMan

This comment has been minimized.

Show comment Hide comment
@BanzaiMan

BanzaiMan Feb 13, 2013

Member

Let's leave this one open, then.

Member

BanzaiMan commented Feb 13, 2013

Let's leave this one open, then.

@enebo

This comment has been minimized.

Show comment Hide comment
@enebo

enebo Feb 28, 2013

Member

I added a regression spec for this in 3a6dbf6. I only did a couple of the same sorts of tests Matt had and I tweaked to use UTF-16BE so all our supported platforms have the encoding discrepancy.

Member

enebo commented Feb 28, 2013

I added a regression spec for this in 3a6dbf6. I only did a couple of the same sorts of tests Matt had and I tweaked to use UTF-16BE so all our supported platforms have the encoding discrepancy.

@enebo enebo closed this Feb 28, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment