Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dir.glob returns UTF-8 string with Windows-31J encoding #4693

Closed
jakago opened this issue Jun 28, 2017 · 3 comments
Closed

Dir.glob returns UTF-8 string with Windows-31J encoding #4693

jakago opened this issue Jun 28, 2017 · 3 comments
Milestone

Comments

@jakago
Copy link

@jakago jakago commented Jun 28, 2017

C:/blah/α.rb

# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'C:/blah/α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

Environment

jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) Client VM 24.65-b04 on 1.7.0_65-b19 +jit [mswin32-x86]

Windows 7 Ultimate Service Pack 1 32-bit

Expected Behavior

C:\blah>ruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

Actual Behavior

C:\blah>jruby α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:Windows-31J>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
@jakago
Copy link
Author

@jakago jakago commented Jun 29, 2017

quick fix

C:\blah>jruby -Eutf-8 α.rb
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]
#<Encoding:UTF-8>
[67, 58, 47, 98, 108, 97, 104, 47, 206, 177, 46, 114, 98]

I checked the source code for Dir.glob.
core/src/main/java/org/jruby/RubyDir.java:229:

Encoding enc = runtime.getDefaultExternalEncoding();

Encoding.default_external for Japanese Windows 7 is Windows-31J (aka cp932 or ms932), and this code uses it for multi-byte file name built by Java which is encoded with UTF-8.

@ahorek
Copy link
Contributor

@ahorek ahorek commented Sep 2, 2017

Ruby uses encoding from input patterns. I tried to fix it, but this case still doesn't work:

Dir.glob(['/tmp'.force_encoding('utf-8'), '/tmp'.force_encoding('windows-1250')])
=> [utf-8, windows-1250]
@headius
Copy link
Member

@headius headius commented Sep 7, 2017

Reproduced on Unix by forcing Windows-31J as external encoding:

[] ~/projects/jruby $ jruby -EWindows-31J α.rb
#Encoding:UTF-8
[206, 177, 46, 114, 98]
#Encoding:Windows-31J
[206, 177, 46, 114, 98]

```ruby
# coding: utf-8

# 'α'.bytes => [206, 177]
path = 'α.rb'
p path.encoding
p path.bytes

Dir.glob(path) do |file|
  p file.encoding
  p file.bytes
end

@ahorek's fix in #4773 does address the primary issue for this bug, but I'm working on a patch that fixes the Dir.glob([...]) case too.

@headius headius closed this in 49d1eb3 Sep 7, 2017
headius added a commit that referenced this issue Sep 7, 2017
@headius headius added this to the JRuby 9.2.0.0 milestone Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.