Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] CP932 containing half-width kana characters cannot be detected correctly. #357

Closed
kzrnm opened this issue Oct 5, 2023 · 1 comment · Fixed by #366
Closed
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence

Comments

@kzrnm
Copy link

kzrnm commented Oct 5, 2023

About hankaku half-width kana. https://en.wikipedia.org/wiki/Half-width_kana

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

https://github.com/competitive-verifier/competitive-verifier/blob/89102a878a9081f72bd3450065bcf7d9fd536a5f/examples/tests/encoding/cp932.txt

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

normalizer -v ../competitive-verifier/examples/tests/encoding/cp932.txt 
2023-10-06 02:56:01,424 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (485 byte(s) given) parameters.
2023-10-06 02:56:01,424 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0x89 in position 0: ordinal not in range(128)
2023-10-06 02:56:01,424 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
2023-10-06 02:56:01,425 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,425 | Level 5 | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,425 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 93.600000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 521.200000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,426 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,426 | Level 5 | Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,426 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 102.400000 %.
2023-10-06 02:56:01,427 | Level 5 | Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2023-10-06 02:56:01,427 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,428 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x76 in position 54: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 107.800000 %.
2023-10-06 02:56:01,428 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,428 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 156.500000 %.
2023-10-06 02:56:01,429 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,429 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.100000 %.
2023-10-06 02:56:01,429 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,429 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 104.400000 %.
2023-10-06 02:56:01,430 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 127.500000 %.
2023-10-06 02:56:01,430 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xe1 in position 27: character maps to <undefined>
2023-10-06 02:56:01,430 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 107.800000 %.
2023-10-06 02:56:01,430 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,430 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa6 in position 325: character maps to <undefined>
2023-10-06 02:56:01,431 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,432 | Level 5 | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,432 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x89 in position 0: character maps to <undefined>
2023-10-06 02:56:01,432 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 75.200000 %.
2023-10-06 02:56:01,432 | Level 5 | Code page cp932 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,433 | Level 5 | cp932 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,433 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0x83 in position 6: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xc9 in position 247: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xc9 in position 247: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,435 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,437 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,437 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 195: character maps to <undefined>
2023-10-06 02:56:01,437 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,437 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,437 | Level 5 | iso8859_15 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,437 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,438 | Level 5 | iso8859_2 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 264: character maps to <undefined>
2023-10-06 02:56:01,438 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,438 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 521.200000 %.
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfa in position 122: character maps to <undefined>
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 264: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc4 in position 33: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | iso8859_9 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,439 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0x83 in position 2: illegal multibyte sequence
2023-10-06 02:56:01,439 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 251.600000 %.
2023-10-06 02:56:01,439 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x8f in position 36: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 251.600000 %.
2023-10-06 02:56:01,440 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,440 | Level 5 | latin_1 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,440 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,440 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 243.600000 %.
2023-10-06 02:56:01,440 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 243.500000 %.
2023-10-06 02:56:01,441 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 200.100000 %.
2023-10-06 02:56:01,441 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | Code page shift_jis is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,441 | Level 5 | shift_jis was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page shift_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,442 | Level 5 | shift_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page shift_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,442 | Level 5 | shift_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 195: character maps to <undefined>
2023-10-06 02:56:01,442 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:56:01,442 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 258-259: illegal encoding
2023-10-06 02:56:01,442 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 150-151: illegal UTF-16 surrogate
2023-10-06 02:56:01,442 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:56:01,443 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:56:01,443 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:56:01,443 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2023-10-06 02:56:01,443 | DEBUG | Encoding detection: Unable to determine any suitable charset.
Unable to identify originating encoding for "../competitive-verifier/examples/tests/encoding/cp932.txt". Maybe try increasing maximum amount of chaos.
{
    "path": "/home/kzrnm/workspace/competitive-verifier/examples/tests/encoding/cp932.txt",
    "encoding": null,
    "encoding_aliases": [],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [],
    "has_sig_or_bom": false,
    "chaos": 1.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

CP932

Desktop (please complete the following information):

  • OS: Ubuntu (WSL on Windows11)
  • Python version : 3.10
  • Package version: 3.3

Additional context

If all the characters are kanji or full-width kana as shown below, charset_normalizer can detect correctly.

https://en.wikipedia.org/wiki/Ame_ni_mo_makezu

雨ニモマケズ
風ニモマケズ
雪ニモ夏ノ暑サニモマケヌ
丈夫ナカラダヲモチ
慾ハナク
決シテ瞋ラズ
イツモシヅカニワラッテヰル
一日ニ玄米四合ト
味噌ト少シノ野菜ヲタベ
アラユルコトヲ
ジブンヲカンジョウニ入レズニ
ヨクミキキシワカリ
ソシテワスレズ
野原ノ松ノ林ノ蔭ノ
小サナ萓ブキノ小屋ニヰテ
東ニ病氣ノコドモアレバ
行ッテ看病シテヤリ
西ニツカレタ母アレバ
行ッテソノ稻ノ朿ヲ負ヒ
南ニ死ニサウナ人アレバ
行ッテコハガラナクテモイヽトイヒ
北ニケンクヮヤソショウガアレバ
ツマラナイカラヤメロトイヒ
ヒデリノトキハナミダヲナガシ
サムサノナツハオロオロアルキ
ミンナニデクノボートヨバレ
ホメラレモセズ
クニモサレズ
サウイフモノニ
ワタシハナリタイ
@kzrnm kzrnm added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Oct 5, 2023
@Ousret
Copy link
Collaborator

Ousret commented Oct 19, 2023

Good catch, indeed.
Fixed in #366

soon to be available.

@Ousret Ousret linked a pull request Oct 19, 2023 that will close this issue
@Ousret Ousret removed the help wanted Extra attention is needed label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants