New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dir.glob returns UTF-8 string with Windows-31J encoding #4693

Closed
jakago opened this Issue Jun 28, 2017 · 3 comments

Comments

Projects
None yet
3 participants
@jakago

jakago commented Jun 28, 2017

C:/blah/α.rb

# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'C:/blah/α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

Environment

jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) Client VM 24.65-b04 on 1.7.0_65-b19 +jit [mswin32-x86]

Windows 7 Ultimate Service Pack 1 32-bit

Expected Behavior

C:\blah>ruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

Actual Behavior

C:\blah>jruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:Windows-31J>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
@jakago

This comment has been minimized.

jakago commented Jun 29, 2017

quick fix

C:\blah>jruby -Eutf-8 α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

I checked the source code for Dir.glob.
core/src/main/java/org/jruby/RubyDir.java:229:

Encoding enc = runtime.getDefaultExternalEncoding();

Encoding.default_external for Japanese Windows 7 is Windows-31J (aka cp932 or ms932), and this code uses it for multi-byte file name built by Java which is encoded with UTF-8.

@ahorek

This comment has been minimized.

Contributor

ahorek commented Sep 2, 2017

Ruby uses encoding from input patterns. I tried to fix it, but this case still doesn't work:

Dir.glob(['/tmp'.force_encoding('utf-8'), '/tmp'.force_encoding('windows-1250')])
=> [utf-8, windows-1250]
@headius

This comment has been minimized.

Member

headius commented Sep 7, 2017

Reproduced on Unix by forcing Windows-31J as external encoding:

[] ~/projects/jruby $ jruby -EWindows-31J α.rb
#Encoding:UTF-8
[206, 177, 46, 114, 98]
#Encoding:Windows-31J
[206, 177, 46, 114, 98]

```ruby
# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

@ahorek's fix in #4773 does address the primary issue for this bug, but I'm working on a patch that fixes the Dir.glob([...]) case too.

@headius headius closed this in 49d1eb3 Sep 7, 2017

@headius headius added this to the JRuby 9.2.0.0 milestone Sep 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment