New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: NKF.nkf failing to convert certain characters #4921

Open
aabryant opened this Issue Dec 30, 2017 · 16 comments

Comments

Projects
None yet
4 participants
@aabryant

aabryant commented Dec 30, 2017

Environment

JRuby Version:
jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 Java HotSpot(TM) 64-Bit Server VM 25.152-b16 on 1.8.0_152-b16 +jit [linux-x86_64]

OS:
Linux Home 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Expected Behavior

require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('shift_jis') # 'Ⅷ' in SJIS
puts NKF.nkf('-X -Z -w', ex)

MRI Ruby

Under MRI Ruby, this snippet of code produces the string , or U+2167. This is what I would expect it to do.

require 'nkf'
puts NKF.nkf('-x -s -W', '')

Under MRI Ruby, this snippet of code will properly convert the UTF-8 string representation of to the SHIFT_JIS one.

Actual Behavior

JRuby

Under JRuby, the first snippet instead produces the following error:

Encoding::UndefinedConversionError: "\x87[" from Shift_JIS to UTF-8
from org/jruby/ext/nkf/RubyNKF.java:229:in `nkf'

The second snippet similarly fails to execute:

Encoding::UndefinedConversionError: "\xE2\x85\xA7" from UTF-8 to Shift_JIS
from org/jruby/ext/nkf/RubyNKF.java:229:in `nkf'

This issue seems to be common to all of the SHIFT_JIS representations of Roman numerals. I have yet to encounter it with any other character, though I regard it as likely that some others are affected.

@kares kares added the encoding label Jan 17, 2018

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 26, 2018

Member

Thank you for the detailed report!

NKF has indeed been a weak point for us, which we mostly ignored because of the move to the newer Encoding subsystem in Ruby 1.9. However we'd always like to see compatibility improved.

It may be as simple as duplicating what MRI uses for NKF, or there might be a pure-Ruby NKF out there build atop the Encoding subsystem.

Member

headius commented Jan 26, 2018

Thank you for the detailed report!

NKF has indeed been a weak point for us, which we mostly ignored because of the move to the newer Encoding subsystem in Ruby 1.9. However we'd always like to see compatibility improved.

It may be as simple as duplicating what MRI uses for NKF, or there might be a pure-Ruby NKF out there build atop the Encoding subsystem.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 26, 2018

Member

This looks like a promising library, should we chose to wrap an existing one: http://mariten.github.io/kanatools-java/en/kana-converter/

Member

headius commented Jan 26, 2018

This looks like a promising library, should we chose to wrap an existing one: http://mariten.github.io/kanatools-java/en/kana-converter/

@aabryant

This comment has been minimized.

Show comment
Hide comment
@aabryant

aabryant Jan 26, 2018

Thank you for responding! It reminded me to dig a little bit deeper into the situation myself, and I think that I've found the root of the issue. It lies not in NKF itself, but rather in JRuby's choice of definition for SHIFT_JIS.

As you may well know, SHIFT_JIS is a blanket name for a variety of encodings. MRI Ruby (like the majority of applications) appears to use MS932 (Microsoft CP932/Windows-31J) as its definition of choice (across all operating systems), while JRuby does not. If we change my example code to this:

require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('cp932') # 'Ⅷ' in SJIS
puts NKF.nkf('-X -Z -w', ex)

It no longer errors and, in fact, returns the output that I would expect! The obvious conclusion is that the shift_jis encoding used by JRuby does not correspond to cp932. If JRuby simply uses Java's definition of SHIFT_JIS, it wouldn't surprise me at all if it corresponds to the original, unexpanded form of SHIFT_JIS that has fallen out of common use.

The really curious thing is that, as JRuby directly copied Ruby's list of Encoding aliases, I never would have found this at all if I had used the name SJIS instead of shift_jis, as SJIS is set as an alias of Windows-31J!

aabryant commented Jan 26, 2018

Thank you for responding! It reminded me to dig a little bit deeper into the situation myself, and I think that I've found the root of the issue. It lies not in NKF itself, but rather in JRuby's choice of definition for SHIFT_JIS.

As you may well know, SHIFT_JIS is a blanket name for a variety of encodings. MRI Ruby (like the majority of applications) appears to use MS932 (Microsoft CP932/Windows-31J) as its definition of choice (across all operating systems), while JRuby does not. If we change my example code to this:

require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('cp932') # 'Ⅷ' in SJIS
puts NKF.nkf('-X -Z -w', ex)

It no longer errors and, in fact, returns the output that I would expect! The obvious conclusion is that the shift_jis encoding used by JRuby does not correspond to cp932. If JRuby simply uses Java's definition of SHIFT_JIS, it wouldn't surprise me at all if it corresponds to the original, unexpanded form of SHIFT_JIS that has fallen out of common use.

The really curious thing is that, as JRuby directly copied Ruby's list of Encoding aliases, I never would have found this at all if I had used the name SJIS instead of shift_jis, as SJIS is set as an alias of Windows-31J!

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 29, 2018

Member

I did a lot of poking around, and there seem to be precious few libraries for explicitly dealing with full and half-width kana and such things. Many links just lead back to NKF, which seems to have little use outside the Ruby bindings.

Member

headius commented Jan 29, 2018

I did a lot of poking around, and there seem to be precious few libraries for explicitly dealing with full and half-width kana and such things. Many links just lead back to NKF, which seems to have little use outside the Ruby bindings.

@aabryant

This comment has been minimized.

Show comment
Hide comment
@aabryant

aabryant Jan 29, 2018

I can't say I'm terribly surprised -- MRI Ruby itself seems to rely on a bundled copy of iconv for most of its conversions. While effective, that does present something of a challenge for a JVM-based implementation.

aabryant commented Jan 29, 2018

I can't say I'm terribly surprised -- MRI Ruby itself seems to rely on a bundled copy of iconv for most of its conversions. While effective, that does present something of a challenge for a JVM-based implementation.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 29, 2018

Member

@aabryant That's very interesting! I'm pulling in @lopex here since he did the original port of the encoding logic and has continued to help us maintain it.

I have not spent a lot of time in our NKF logic, but it does appear to be using the ported encoding subsystem. That makes me think that perhaps our version of that logic is improperly aliasing "shift_jis" to the wrong actual encoding. We have even run into Ruby compatibility tests that expected an encoding to be "Windows31J" instead of "Shift_JIS" and never quite knew why.

Member

headius commented Jan 29, 2018

@aabryant That's very interesting! I'm pulling in @lopex here since he did the original port of the encoding logic and has continued to help us maintain it.

I have not spent a lot of time in our NKF logic, but it does appear to be using the ported encoding subsystem. That makes me think that perhaps our version of that logic is improperly aliasing "shift_jis" to the wrong actual encoding. We have even run into Ruby compatibility tests that expected an encoding to be "Windows31J" instead of "Shift_JIS" and never quite knew why.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

Ok, trying to get this off my plate.

So in jcodings EncodingList, I found this line:

        EncodingDB.declare("Shift_JIS", "SJIS");

But later in the same file I found this:

        EncodingDB.alias("SJIS", "Windows-31J");

This seems odd.

Member

headius commented Feb 26, 2018

Ok, trying to get this off my plate.

So in jcodings EncodingList, I found this line:

        EncodingDB.declare("Shift_JIS", "SJIS");

But later in the same file I found this:

        EncodingDB.alias("SJIS", "Windows-31J");

This seems odd.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

If I modify the logic to use Windows-31J all the time for "Shift_JIS", your script passes.

Member

headius commented Feb 26, 2018

If I modify the logic to use Windows-31J all the time for "Shift_JIS", your script passes.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

A few more discoveries.

  • In NKF's autodetection code (in CRuby) I see this logic:
static const char*
get_guessed_code(void)
{   
    if (input_codename && !*input_codename) {
        input_codename = "BINARY";
    } else {
        struct input_code *p = find_inputcode_byfunc(iconv);
        if (!input_codename) {
            input_codename = "ASCII";
        } else if (strcmp(input_codename, "Shift_JIS") == 0) {
            if (p->score & (SCORE_DEPEND|SCORE_CP932))
                input_codename = "CP932";
...

So if the NKF subsystem detects "Shift_JIS" is the encoding, and whatever "score" means matches up, it uses CP932 (Windows-31J) for the NKF processing.

  • Our encoding detection is broken for this example. In RubyNKF.guess there's logic that attempts to use the x-JISAutoDetect encoding from the JDK to detect which type of JIS it is. For this small input, however, it fails immediately. It seems like it needs more than one character to do the detection.

  • After failing to detect the encoding from the actual bytes, it falls back on using whatever JDK Charset corresponds to the name of the string's encoding, and this is where it falls into Java's definition of Shift_JIS (which must be the older one) rather than CP932.

Member

headius commented Feb 26, 2018

A few more discoveries.

  • In NKF's autodetection code (in CRuby) I see this logic:
static const char*
get_guessed_code(void)
{   
    if (input_codename && !*input_codename) {
        input_codename = "BINARY";
    } else {
        struct input_code *p = find_inputcode_byfunc(iconv);
        if (!input_codename) {
            input_codename = "ASCII";
        } else if (strcmp(input_codename, "Shift_JIS") == 0) {
            if (p->score & (SCORE_DEPEND|SCORE_CP932))
                input_codename = "CP932";
...

So if the NKF subsystem detects "Shift_JIS" is the encoding, and whatever "score" means matches up, it uses CP932 (Windows-31J) for the NKF processing.

  • Our encoding detection is broken for this example. In RubyNKF.guess there's logic that attempts to use the x-JISAutoDetect encoding from the JDK to detect which type of JIS it is. For this small input, however, it fails immediately. It seems like it needs more than one character to do the detection.

  • After failing to detect the encoding from the actual bytes, it falls back on using whatever JDK Charset corresponds to the name of the string's encoding, and this is where it falls into Java's definition of Shift_JIS (which must be the older one) rather than CP932.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

There are a few other autodetection options listed here: https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

I think at this point the quick fix would be to force the use of Windows-31J when the incoming encoding is actually Shift_JIS, since that appears to be what MRI is doing (subject to the "score" flags there).

Member

headius commented Feb 26, 2018

There are a few other autodetection options listed here: https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

I think at this point the quick fix would be to force the use of Windows-31J when the incoming encoding is actually Shift_JIS, since that appears to be what MRI is doing (subject to the "score" flags there).

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

Ok, this is interesting. The following script reproduces the same error in JRuby and CRuby:

require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('Shift_JIS') # 'Ⅷ' in SJIS
puts ex.encode('UTF-8')

So what's happening in our NKF is that we're falling back on Shift_JIS (since our detection doesn't work), which does not have conversion paths to/from UTF-8. MRI attempts to detect the encoding, and upon detecting Shift_JIS it seems to use CP932 in some cases. When we force MRI to also use Shift_JIS, it errors like we do.

Another interesting bit: by virtue of its detection, MRI can handle these bytes as "BINARY" and still work.

Member

headius commented Feb 26, 2018

Ok, this is interesting. The following script reproduces the same error in JRuby and CRuby:

require 'nkf'
ex = [0x87, 0x5B].pack('C*').force_encoding('Shift_JIS') # 'Ⅷ' in SJIS
puts ex.encode('UTF-8')

So what's happening in our NKF is that we're falling back on Shift_JIS (since our detection doesn't work), which does not have conversion paths to/from UTF-8. MRI attempts to detect the encoding, and upon detecting Shift_JIS it seems to use CP932 in some cases. When we force MRI to also use Shift_JIS, it errors like we do.

Another interesting bit: by virtue of its detection, MRI can handle these bytes as "BINARY" and still work.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

I have tried two of the top contenders for charset detection. Neither worked like I hoped. I think the problem is that they're too general and not focused on just JIS-related encodings.

First I tried the "juniversalchardet" project, which appears to have been forked all over the place. It detected the incoming bytes as Windows-1252.

I also tried ICU4J. It detected Big5.

I have not found a JIS-only detector yet. I have also not confirmed whether MRI successfully detects the content, but given that it works with BINARY encoding, I assume it does.

Member

headius commented Feb 26, 2018

I have tried two of the top contenders for charset detection. Neither worked like I hoped. I think the problem is that they're too general and not focused on just JIS-related encodings.

First I tried the "juniversalchardet" project, which appears to have been forked all over the place. It detected the incoming bytes as Windows-1252.

I also tried ICU4J. It detected Big5.

I have not found a JIS-only detector yet. I have also not confirmed whether MRI successfully detects the content, but given that it works with BINARY encoding, I assume it does.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

The file command can't guess this encoding from two bytes either:

[] ~/projects/jruby $ jruby -e "File.write('jis.txt', [0x87, 0x5B].pack('C*').force_encoding('BINARY'))"

[] ~/projects/jruby $ file jis.txt
jis.txt: Non-ISO extended-ASCII text, with no line terminators
Member

headius commented Feb 26, 2018

The file command can't guess this encoding from two bytes either:

[] ~/projects/jruby $ jruby -e "File.write('jis.txt', [0x87, 0x5B].pack('C*').force_encoding('BINARY'))"

[] ~/projects/jruby $ file jis.txt
jis.txt: Non-ISO extended-ASCII text, with no line terminators
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

Another attempt using jgloss produced EUC-JP and a different error because this isn't valid in EUC-JP.

I'm starting to think that perhaps MRI can't detect it either, but because the default mode is to assume "JIS", it uses that.

Member

headius commented Feb 26, 2018

Another attempt using jgloss produced EUC-JP and a different error because this isn't valid in EUC-JP.

I'm starting to think that perhaps MRI can't detect it either, but because the default mode is to assume "JIS", it uses that.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Feb 26, 2018

Member

Pinging @nurse to see if perhaps they can give us some tips here!

Member

headius commented Feb 26, 2018

Pinging @nurse to see if perhaps they can give us some tips here!

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Feb 27, 2018

Contributor

Japan defines character set for their use as JIS (Japanese Industrial Standards).
JIS X 0208 is one of them, and 'Ⅷ' is not in JIS X 0208.
Microsoft defines an extended character set of JIS X 0208 named Windows-31J (CP932), which includes 'Ⅷ'.

irb(main):001:0> 'Ⅷ'.encode("windows-31j")
=> "\x{875B}"
irb(main):002:0> 'Ⅷ'.encode("shift_jis")
Encoding::UndefinedConversionError: U+2167 from UTF-8 to Shift_JIS
	from (irb):2:in `encode'
	from (irb):2
	from /Users/naruse/.rbenv/versions/2.4.0/bin/irb:11:in `<main>'

nkf is a tool to read net news.
Therefore it has fallback conversions.
When nkf runs to convert Shift_JIS bu failed to convert a character, it tires to convert with Windows-31J table.

Your current solution, using Windows-31J table instead of Shift_JIS, has a problem.
It won't convert U+301C to Shift-JIS.
Because Shift_JIS has a conversion 0x8160 in Shift-JIS to U+301C, but Windows-31J converts 0x8160 into U+FF5E.
Using both of two table can emulate nkf's behavior.


Japanese used 3 encodings families, Shift-JIS, Japanese EUC, and ISO/2022.
ISO/2022 encodings use escape sequence, and easily autodetect.

Both of Shift-JIS and Japanese EUC are multibyte encodings
Just encodes a binary as Windows-31J and EUC-JP and assume the succeeded one can work usually.
But they may conflict.
For example both of "ネコ" in Shift-JIS and "蛤" in Japanese EUC are encoded as "\xC8\xBA".

But "ネコ" in Shift-JIS is alternative representation of "ネコ".
First one is half width katakana, and latter one is full width katakana, and usually they use full width katakana.
Therefore if nkf hits "\xC8\xBA", it guesses it as EUC.

The scoring idea is such one.

If it is implemented in Java, just assume a binary as some encodings and convert to Unicode string, and then get the category of the character.

Contributor

nurse commented Feb 27, 2018

Japan defines character set for their use as JIS (Japanese Industrial Standards).
JIS X 0208 is one of them, and 'Ⅷ' is not in JIS X 0208.
Microsoft defines an extended character set of JIS X 0208 named Windows-31J (CP932), which includes 'Ⅷ'.

irb(main):001:0> 'Ⅷ'.encode("windows-31j")
=> "\x{875B}"
irb(main):002:0> 'Ⅷ'.encode("shift_jis")
Encoding::UndefinedConversionError: U+2167 from UTF-8 to Shift_JIS
	from (irb):2:in `encode'
	from (irb):2
	from /Users/naruse/.rbenv/versions/2.4.0/bin/irb:11:in `<main>'

nkf is a tool to read net news.
Therefore it has fallback conversions.
When nkf runs to convert Shift_JIS bu failed to convert a character, it tires to convert with Windows-31J table.

Your current solution, using Windows-31J table instead of Shift_JIS, has a problem.
It won't convert U+301C to Shift-JIS.
Because Shift_JIS has a conversion 0x8160 in Shift-JIS to U+301C, but Windows-31J converts 0x8160 into U+FF5E.
Using both of two table can emulate nkf's behavior.


Japanese used 3 encodings families, Shift-JIS, Japanese EUC, and ISO/2022.
ISO/2022 encodings use escape sequence, and easily autodetect.

Both of Shift-JIS and Japanese EUC are multibyte encodings
Just encodes a binary as Windows-31J and EUC-JP and assume the succeeded one can work usually.
But they may conflict.
For example both of "ネコ" in Shift-JIS and "蛤" in Japanese EUC are encoded as "\xC8\xBA".

But "ネコ" in Shift-JIS is alternative representation of "ネコ".
First one is half width katakana, and latter one is full width katakana, and usually they use full width katakana.
Therefore if nkf hits "\xC8\xBA", it guesses it as EUC.

The scoring idea is such one.

If it is implemented in Java, just assume a binary as some encodings and convert to Unicode string, and then get the category of the character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment