Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Traditional Chinese Edition: unknown encoding name - MS950 #5707

Closed
umairsair opened this issue Apr 24, 2019 · 34 comments
Closed

Windows Traditional Chinese Edition: unknown encoding name - MS950 #5707

umairsair opened this issue Apr 24, 2019 · 34 comments
Milestone

Comments

@umairsair
Copy link

umairsair commented Apr 24, 2019

Environment

JRuby 9.2.6.0
Windows 7 Traditional Chinese Edition

I am getting following error with JRuby.

org.jruby.exceptions.MainExitException: unknown encoding name - MS950

Quick analysis shows that we don't have an entry for MS950 in org.jcodings.EncodingList.

This problem was not present in very old JRuby version. Seems that the following commits changed the behavior on this Windows edition.

1ceddaa

239c726

This problem is blocker on this windows edition.

It seems to me that MS950 is alias of Big5, if this is correct then we just need one liner change in EncodingList class.

@umairsair umairsair changed the title Windows Traditional Chinese Editionunknown encoding name - MS950 Windows Traditional Chinese Edition: unknown encoding name - MS950 Apr 24, 2019
@headius
Copy link
Member

headius commented Apr 25, 2019

@umairsair Thanks for the analysis! I think you're right, we just need an alias. @lopex does that seem right to you?

@headius
Copy link
Member

headius commented Apr 25, 2019

I see no such alias in CRuby. Does CRuby work properly in your environment?

@headius
Copy link
Member

headius commented Apr 25, 2019

From the Wikipedia page, it sounds like MS950 is not exactly the same as Big5, but it's probably close enough to alias?

@umairsair
Copy link
Author

Does CRuby work properly in your environment?

I haven't tried it extensively. I just tried simple commands like reading the file. I tried reading the file with encoding "MS950" and "x-windows-950" and it just gave me warning.

warning: Unsupported encoding x-windows-950 ignored

May be we should do the same in jruby; instead of throwing exception, just use some default encoding.

From the Wikipedia page, it sounds like MS950 is not exactly the same as Big5, but it's probably close enough to alias?

Yes, it seems MS950 is a bit different but I couldn't find any details of the actual difference. I have the same opinion that it is close enough to alias. If we don't get the details on the encoding difference then probably we should add it as alias until someone comes up with the a problem with it. WDYS?

@lopex
Copy link
Member

lopex commented Apr 25, 2019

I think we should warn and use default encoding then. Otherwise, maybe this could be raised as an MRI issue ?

@headius
Copy link
Member

headius commented Apr 25, 2019

@lopex That's a good point.

@umairsair Can you confirm whether CRuby/MRI has the same issue? If so, we probably should coordinate with them on a fix.

@umairsair
Copy link
Author

I think we should warn and use default encoding then. Otherwise, maybe this could be raised as an MRI issue ?

IMO we should do both; instead of just failing, we should warn and move on same as CRuby. And an MRI issue to support MS950 encoding.

I am using JRuby in eclipse as plugin dependency and there is no way to get around this problem except setting "file.encoding" property of java to some other encoding and it changes the complete eclipse environment and its not acceptable.

@umairsair Can you confirm whether CRuby/MRI has the same issue?

It doesn't support MS950 encoding but not blocking in anyway until I am doing stuff on non-Chinese language stuff on this windows edition. Is there anything specific that you want me to try out with MRI?

@headius
Copy link
Member

headius commented Apr 25, 2019

@umairsair It helps to know that it still works. What encoding does it end up choosing?

@headius
Copy link
Member

headius commented Apr 25, 2019

@umairsair Can you show us the full backtrace for that exception please?

If we can determine what CRuby falls back on we can make this change fairly quickly.

@umairsair
Copy link
Author

umairsair commented Apr 26, 2019

I read a file created with Chinese characters in it.

irb(main):034:0> s= File.read("c:/temp/temp.txt", :encoding => 'ms950')
(irb):34: warning: Unsupported encoding ms950 ignored
=> "檔案資料夾"
irb(main):035:0> s.encoding
=> #<Encoding:CP950>

Following is the backtrace. Exception is thrown from here.

Thread [Worker-7] (Suspended)	
	owns: LocalContext  (id=273)	
	Ruby.initCore() line: 1488	
	Ruby.bootstrap() line: 1340	
	Ruby.init() line: 1237	
	Ruby.newInstance(RubyInstanceConfig) line: 368	
	LocalContext.getRuntime() line: 117	
	SingleThreadLocalContextProvider.getRuntime() line: 62	
	EmbedRubyRuntimeAdapterImpl.runParser(Object, String, int...) line: 167	
	EmbedRubyRuntimeAdapterImpl.parse(String, int...) line: 94	
	ScriptingContainer.parse(String, int...) line: 1227	
	ScriptingContainer.runScriptlet(String) line: 1287	
        ..........

Update: removing unnecessary frame.

@headius
Copy link
Member

headius commented Apr 26, 2019

Oh CP950!

@umairsair
Copy link
Author

So do you have a possible solution to fix it? Anything else I can help you with?

@headius
Copy link
Member

headius commented Apr 26, 2019

I'm discussing it on matrix now with @lopex. We could just add the alias, but our list of encodings is generated from CRuby. We'd rather figure out how they're falling back and why they warn but still apparently pick the right encoding.

@headius
Copy link
Member

headius commented Apr 26, 2019

Oh one thing you might be able to do is force the JVM to use CP950 instead of MS950 by passing -J-Dfile.encoding=CP950 to JRuby. That's the property we look at to try to figure out the system encoding.

@headius
Copy link
Member

headius commented Apr 26, 2019

@umairsair I just noticed something odd about your stack trace above: JRubyParser. There is no such class in JRuby...the only place such a class exists is in the external jruby-parser project, which I believe has not been updated in some time (@enebo knows better than I). I would not at all be surprised to find that it's having trouble with unknown encodings since it doesn't use the same mechanisms as JRuby proper to deal with them.

Please provide an example of how you're running JRuby to trigger this error. At this point we have been unable to reproduce your issue on any current version of JRuby, and that JRubyParser line in the stack trace is highly suspect.

@umairsair
Copy link
Author

Oh one thing you might be able to do is force the JVM to use CP950 instead of MS950 by passing -J-Dfile.encoding=CP950 to JRuby.

I think we don't have JRuby specific property for file encoding. I see SafePropertyAccessor.getProperty(..) and it is getting property using System.getProperty(..). Kindly point me to correct API if I am wrong.

@umairsair I just noticed something odd about your stack trace above: JRubyParser.

Sorry for causing confusion, its my own class that just calls ScriptingContainer.runScriptlet(..).

Please provide an example of how you're running JRuby to trigger this error.

A very simple example.

new ScriptingContainer(LocalContextScope.SINGLETHREAD).runScriptlet("require 'FileUtils'")

At this point we have been unable to reproduce your issue on any current version of JRuby,

Can you please tell how you are trying to reproduce?

@headius
Copy link
Member

headius commented Apr 30, 2019

Can you please tell how you are trying to reproduce?

Well so far I've just been trying to get JRuby to run with MS950 as the system encoding, but it doesn't seem to trigger any issues. I'm thinking this may be specific to the ScriptingContainer.runScriptlet(..) path at this point, so we'll try to use your example and force the same system encoding.

@umairsair
Copy link
Author

On non-chinese edition of Windows 7, I am able to reproduce this issue by enforcing java file.encoding to MS950. So I guess you will also be able to reproduce it.

@umairsair
Copy link
Author

Hello @headius , @lopex ,

Are you guys able to find an appropriate solution so far? What are our plans for this issue? This issue is blocker for us. Kindly let me know if I can help you in any way to fix this issue ASAP.

@headius
Copy link
Member

headius commented May 14, 2019

@umairsair Sorry for the delay. I'm back to work this week and looking into this.

@headius
Copy link
Member

headius commented May 14, 2019

So running the following code with file.encoding set to MS950 does not appear to fix this.

container = org.jruby.embed.ScriptingContainer.new(
    org.jruby.embed.LocalContextScope::SINGLETHREAD)
container.runScriptlet("require %{FileUtils}")

A workaround for you might be to set file.encoding to CP950.

@headius
Copy link
Member

headius commented May 14, 2019

@umairsair Ok so I'm not sure we know how to proceed at this point. For all cases I have tested with MS950 as an encoding, we behave the same as CRuby. When I run your ScriptingContainer code with -Dfile.encoding=MS950 it appears to work just fine. And your stack trace containing JRubyParser is still very suspicious because we do not ship this class in JRuby 9.2.6.0.

At this point I have two suggestions for you:

  • Try forcing file.encoding JVM property to be CP950 and see if that helps.
  • If you still have trouble, push a Github repository with a reproducible example. I've only seen small snippits of code from you and this would go a lot faster if you could push a repository that I can clone and run to see the issue.

Sorry we have been unable to help you, but without a reproduction we would have to blindly guess at what's wrong.

@umairsair
Copy link
Author

Are you running it from a java application? From the snippet, it seems that you are running it from JRuby terminal.

If you are unable to reproduce it using java application, I'll share the java application to reproduce this issue.

Try forcing file.encoding JVM property to be CP950 and see if that helps.

This workaround will work but as I mentioned earlier, we cannot enforce any encoding because it'll change the whole java environment.

@umairsair
Copy link
Author

@headius ,

I have pushed a sample at following location and added the instructions in readme.

https://github.com/umairsair/jruby-issue-5707/tree/master

@headius
Copy link
Member

headius commented May 15, 2019

Reproduced!

@headius
Copy link
Member

headius commented May 15, 2019

Ok so now I see where it's happening and why it only affects windows:

if (Platform.IS_WINDOWS) {
encoding = SafePropertyAccessor.getProperty("file.encoding", "UTF-8");
Encoding filesystemEncoding = encodingService.loadEncoding(ByteList.create(encoding));
if (filesystemEncoding == null) throw new MainExitException(1, "unknown encoding name - " + encoding);
setDefaultFilesystemEncoding(filesystemEncoding);
} else {
setDefaultFilesystemEncoding(getDefaultExternalEncoding());
}

@headius
Copy link
Member

headius commented May 15, 2019

@umairsair Ok so my suggested workaround of setting file.encoding=CP950 appears to work ok, or at least it allows your example to function properly. So there's your workaround.

Short term fix in JRuby will be to simply fall back on default external, but I'm not certain this is the right fix just yet. cc @lopex @enebo

headius added a commit to headius/jruby that referenced this issue May 15, 2019
headius added a commit to headius/jruby that referenced this issue May 15, 2019
@umairsair
Copy link
Author

Thanks @headius for the quick fix. In basic testing, I have verified the fix on Windows Traditional Chinese edition and it is working fine; ASCII-8BIT is default external encoding.

I'll back port this fix and try to build JRuby (quick guide to build only jruby jar would be helpful :)

BTW when is 9.2.8.0 release expected?

@headius
Copy link
Member

headius commented May 16, 2019

If you need the complete jar, run: ./mvnw -Pcomplete

The complete jar will be built into the lib/ dir.

9.2.8.0 could probably go any time but there's a large rework of load/require I'd hoped to finish. We will discuss today.

@headius
Copy link
Member

headius commented May 16, 2019

I looked into how CRuby does this.

Basically the piece we're missing is the ability to get the exact code page number and then look up based on that. Both MS950 and CP950 are names for code page 950, so MRI never sees the "MS" part when picking the default filesystem encoding.

We could of course bind those methods via FFI but I'd rather have a consistent way to do this without a native dependency. That might be as simple as looking for encoding patterns of "MS####" and swapping them for "CP####".

@headius
Copy link
Member

headius commented May 16, 2019

I pushed an update that will attempt to translate /^MS([0-9]+)$/ codepage name to the "CP" form before attempting to look up the encoding. In your case, this should allow it to successfully pick the CP950 code page rather than falling back on ASCII-8BIT.

@headius
Copy link
Member

headius commented May 17, 2019

FWIW the logic for this is in localeinit.c in MRI:

int
Init_enc_set_filesystem_encoding(void)
{
    int idx;
#if NO_LOCALE_CHARMAP
    idx = ENCINDEX_US_ASCII;
#elif defined _WIN32
    char cp[SIZEOF_CP_NAME];
    const UINT codepage = ruby_w32_codepage[1] ? ruby_w32_codepage[1] :
        AreFileApisANSI() ? GetACP() : GetOEMCP();
    CP_FORMAT(cp, codepage);
    idx = rb_enc_find_index(cp);
    if (idx < 0) idx = ENCINDEX_ASCII;
#elif defined __CYGWIN__
    idx = ENCINDEX_UTF_8;
#else
    idx = rb_enc_to_index(rb_default_external_encoding());
#endif
    return idx;
}

@headius headius added this to the JRuby 9.2.8.0 milestone May 17, 2019
@headius
Copy link
Member

headius commented May 17, 2019

I'm going to call this fixed with the merge of #5733. It will be in 9.2.8.0.

@umairsair
Copy link
Author

Thanks a lot once again @headius ! I'll try it on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants