Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] EUC-KR files are not detected correctly. #356

Closed
kzrnm opened this issue Oct 5, 2023 · 1 comment · Fixed by #366
Closed

[DETECTION] EUC-KR files are not detected correctly. #356

kzrnm opened this issue Oct 5, 2023 · 1 comment · Fixed by #366
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence

Comments

@kzrnm
Copy link

kzrnm commented Oct 5, 2023

EUC-KR files are not detected correctly. Charset-Normalizer 2.1.1 detected it correctly.

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

https://github.com/competitive-verifier/competitive-verifier/blob/bc30581761d4ae94f79f1daf8e9647dc2a7a67f0/examples/tests/encoding/EUC-KR.txt

Verbose output

$ normalizer --version
Charset-Normalizer 3.3.0 - Python 3.10.12 - Unicode 13.0.0 - SpeedUp ON
$ normalizer -v ../competitive-verifier/examples/tests/encoding/EUC-KR.txt
2023-10-06 02:33:51,996 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (863 byte(s) given) parameters.
2023-10-06 02:33:51,996 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
2023-10-06 02:33:51,996 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
2023-10-06 02:33:51,996 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xc8 in position 623: illegal multibyte sequence
2023-10-06 02:33:51,996 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:51,998 | Level 5 | big5hkscs passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:51,998 | Level 5 | big5hkscs should target any language(s) of ['Chinese']
2023-10-06 02:33:51,998 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 386.700000 %.
2023-10-06 02:33:51,999 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 87.700000 %.
2023-10-06 02:33:51,999 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:33:51,999 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:33:51,999 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,000 | Level 5 | cp1250 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 61.400000 %.
2023-10-06 02:33:52,000 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 57.100000 %.
2023-10-06 02:33:52,000 | Level 5 | cp1252 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.600000 %.
2023-10-06 02:33:52,000 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 315: character maps to <undefined>
2023-10-06 02:33:52,001 | Level 5 | cp1254 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.600000 %.
2023-10-06 02:33:52,001 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:33:52,001 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 32.300000 %.
2023-10-06 02:33:52,001 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 14: character maps to <undefined>
2023-10-06 02:33:52,001 | Level 5 | cp1258 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 67.600000 %.
2023-10-06 02:33:52,001 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,002 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xcd in position 5: character maps to <undefined>
2023-10-06 02:33:52,002 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:33:52,002 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,003 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:33:52,003 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:33:52,003 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 96.900000 %.
2023-10-06 02:33:52,003 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,004 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 104.700000 %.
2023-10-06 02:33:52,004 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 96.900000 %.
2023-10-06 02:33:52,004 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc7 in position 2: character maps to <undefined>
2023-10-06 02:33:52,004 | Level 5 | Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd5 in position 161: character maps to <undefined>
2023-10-06 02:33:52,005 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,005 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,005 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,005 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,005 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,006 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 97: character maps to <undefined>
2023-10-06 02:33:52,006 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,006 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,006 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 93.800000 %.
2023-10-06 02:33:52,006 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:33:52,007 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 191.500000 %.
2023-10-06 02:33:52,007 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:33:52,007 | Level 5 | Code page cp949 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,009 | Level 5 | cp949 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,009 | Level 5 | cp949 should target any language(s) of ['Korean']
2023-10-06 02:33:52,009 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xc8 in position 623: illegal multibyte sequence
2023-10-06 02:33:52,009 | Level 5 | Code page euc_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,011 | Level 5 | euc_jis_2004 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,011 | Level 5 | euc_jis_2004 should target any language(s) of ['Japanese']
2023-10-06 02:33:52,011 | Level 5 | Code page euc_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,011 | Level 5 | euc_jisx0213 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,011 | Level 5 | euc_jisx0213 should target any language(s) of ['Japanese']
2023-10-06 02:33:52,012 | Level 5 | Code page euc_jp is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,012 | Level 5 | euc_jp passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,012 | Level 5 | euc_jp should target any language(s) of ['Japanese']
2023-10-06 02:33:52,012 | Level 5 | Code page euc_kr is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,012 | Level 5 | euc_kr passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,012 | Level 5 | euc_kr should target any language(s) of ['Korean']
2023-10-06 02:33:52,012 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,013 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,014 | Level 5 | gb18030 should target any language(s) of ['Chinese']
2023-10-06 02:33:52,014 | Level 5 | Code page gb2312 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,014 | Level 5 | gb2312 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,014 | Level 5 | gb2312 should target any language(s) of ['Chinese']
2023-10-06 02:33:52,015 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:33:52,015 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:33:52,015 | Level 5 | gbk should target any language(s) of ['Chinese']
2023-10-06 02:33:52,015 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 88.300000 %.
2023-10-06 02:33:52,015 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,016 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:33:52,017 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 55.700000 %.
2023-10-06 02:33:52,017 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:33:52,017 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 49.000000 %.
2023-10-06 02:33:52,017 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,018 | Level 5 | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,018 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 63.400000 %.
2023-10-06 02:33:52,018 | Level 5 | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,018 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 14: character maps to <undefined>
2023-10-06 02:33:52,018 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,019 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 20.300000 %.
2023-10-06 02:33:52,019 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb7 in position 9: character maps to <undefined>
2023-10-06 02:33:52,019 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 181: character maps to <undefined>
2023-10-06 02:33:52,019 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc4 in position 0: character maps to <undefined>
2023-10-06 02:33:52,019 | Level 5 | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,019 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xc7 in position 7: illegal multibyte sequence
2023-10-06 02:33:52,020 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 100.000000 %.
2023-10-06 02:33:52,020 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xbe in position 23: character maps to <undefined>
2023-10-06 02:33:52,020 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 90.800000 %.
2023-10-06 02:33:52,020 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,020 | Level 5 | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,021 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 40.000000 %.
2023-10-06 02:33:52,021 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.600000 %.
2023-10-06 02:33:52,021 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 79.400000 %.
2023-10-06 02:33:52,021 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 133.000000 %.
2023-10-06 02:33:52,022 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:33:52,022 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:33:52,022 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:33:52,023 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:33:52,023 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:33:52,023 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:33:52,023 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:33:52,023 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:33:52,023 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 166-167: illegal UTF-16 surrogate
2023-10-06 02:33:52,023 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 104-105: illegal UTF-16 surrogate
2023-10-06 02:33:52,023 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:33:52,024 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:33:52,024 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:33:52,024 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2023-10-06 02:33:52,024 | DEBUG | Encoding detection: Found big5hkscs as plausible (best-candidate) for content. With 4 alternatives.
{
    "path": "/home/kzrnm/competitive-verifier/examples/tests/encoding/EUC-KR.txt",
    "encoding": "big5hkscs",
    "encoding_aliases": [
        "big5_hkscs",
        "hkscs"
    ],
    "alternative_encodings": [],
    "language": "Chinese",
    "alphabets": [
        "Basic Latin",
        "CJK Unified Ideographs",
        "Cyrillic",
        "Enclosed Alphanumerics",
        "Katakana",
        "Small Form Variants"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding
A clear and concise description of what you expected as encoding. Any more details about how the current guess is wrong
is very much appreciated.


$ normalizer --version
Charset-Normalizer 2.1.1 - Python 3.10.12 - Unicode 13.0.0
$ normalizer -v ../competitive-verifier/examples/tests/encoding/EUC-KR.txt
2023-10-06 02:37:12,629 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (863 byte(s) given) parameters.
2023-10-06 02:37:12,629 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
2023-10-06 02:37:12,629 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte
2023-10-06 02:37:12,630 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xc8 in position 623: illegal multibyte sequence
2023-10-06 02:37:12,630 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,632 | Level 5 | big5hkscs passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,632 | Level 5 | big5hkscs should target any language(s) of ['Chinese', 'Classical Chinese']
2023-10-06 02:37:12,633 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 386.700000 %.
2023-10-06 02:37:12,633 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,633 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:37:12,634 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,634 | Level 5 | cp1250 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 61.400000 %.
2023-10-06 02:37:12,634 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 57.100000 %.
2023-10-06 02:37:12,635 | Level 5 | cp1252 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.600000 %.
2023-10-06 02:37:12,635 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 315: character maps to <undefined>
2023-10-06 02:37:12,635 | Level 5 | cp1254 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 66.600000 %.
2023-10-06 02:37:12,635 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:37:12,636 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 32.300000 %.
2023-10-06 02:37:12,636 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 14: character maps to <undefined>
2023-10-06 02:37:12,636 | Level 5 | cp1258 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 67.600000 %.
2023-10-06 02:37:12,636 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,637 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xcd in position 5: character maps to <undefined>
2023-10-06 02:37:12,637 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.200000 %.
2023-10-06 02:37:12,637 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,638 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 96.900000 %.
2023-10-06 02:37:12,638 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,638 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 104.700000 %.
2023-10-06 02:37:12,638 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 96.900000 %.
2023-10-06 02:37:12,639 | Level 5 | Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd5 in position 161: character maps to <undefined>
2023-10-06 02:37:12,639 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,639 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,639 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,639 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,640 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,640 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 97: character maps to <undefined>
2023-10-06 02:37:12,640 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,640 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,640 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 93.800000 %.
2023-10-06 02:37:12,641 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:37:12,641 | Level 5 | Code page cp949 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,644 | Level 5 | cp949 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,644 | Level 5 | cp949 should target any language(s) of ['Korean']
2023-10-06 02:37:12,645 | Level 5 | We detected language [('Korean', 0.1644)] using cp949
2023-10-06 02:37:12,645 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xc8 in position 623: illegal multibyte sequence
2023-10-06 02:37:12,645 | Level 5 | Code page euc_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,647 | Level 5 | euc_jis_2004 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,647 | Level 5 | euc_jis_2004 should target any language(s) of ['Japanese']
2023-10-06 02:37:12,648 | Level 5 | Code page euc_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,648 | Level 5 | euc_jisx0213 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,648 | Level 5 | euc_jisx0213 should target any language(s) of ['Japanese']
2023-10-06 02:37:12,648 | Level 5 | Code page euc_jp is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,648 | Level 5 | euc_jp passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,648 | Level 5 | euc_jp should target any language(s) of ['Japanese']
2023-10-06 02:37:12,648 | Level 5 | Code page euc_kr is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,649 | Level 5 | euc_kr passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,649 | Level 5 | euc_kr should target any language(s) of ['Korean']
2023-10-06 02:37:12,649 | Level 5 | We detected language [('Korean', 0.1644)] using euc_kr
2023-10-06 02:37:12,649 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,651 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,651 | Level 5 | gb18030 should target any language(s) of ['Chinese', 'Classical Chinese']
2023-10-06 02:37:12,651 | Level 5 | Code page gb2312 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,652 | Level 5 | gb2312 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,652 | Level 5 | gb2312 should target any language(s) of ['Chinese', 'Classical Chinese']
2023-10-06 02:37:12,653 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:37:12,653 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-06 02:37:12,653 | Level 5 | gbk should target any language(s) of ['Chinese', 'Classical Chinese']
2023-10-06 02:37:12,653 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 88.300000 %.
2023-10-06 02:37:12,653 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,654 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,655 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xc4 in position 0: illegal multibyte sequence
2023-10-06 02:37:12,655 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 55.700000 %.
2023-10-06 02:37:12,655 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:37:12,656 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 49.000000 %.
2023-10-06 02:37:12,656 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,656 | Level 5 | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,656 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 63.400000 %.
2023-10-06 02:37:12,657 | Level 5 | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,657 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa5 in position 14: character maps to <undefined>
2023-10-06 02:37:12,657 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,658 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 20.300000 %.
2023-10-06 02:37:12,658 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb7 in position 9: character maps to <undefined>
2023-10-06 02:37:12,658 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 181: character maps to <undefined>
2023-10-06 02:37:12,659 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc4 in position 0: character maps to <undefined>
2023-10-06 02:37:12,659 | Level 5 | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,659 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xc7 in position 7: illegal multibyte sequence
2023-10-06 02:37:12,660 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 100.000000 %.
2023-10-06 02:37:12,660 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,660 | Level 5 | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,661 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 40.000000 %.
2023-10-06 02:37:12,661 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.600000 %.
2023-10-06 02:37:12,662 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 79.400000 %.
2023-10-06 02:37:12,662 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 133.000000 %.
2023-10-06 02:37:12,663 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:37:12,663 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:37:12,663 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:37:12,663 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:37:12,663 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:37:12,664 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xee in position 24: illegal multibyte sequence
2023-10-06 02:37:12,664 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 105: character maps to <undefined>
2023-10-06 02:37:12,664 | Level 5 | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:37:12,664 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 166-167: illegal UTF-16 surrogate
2023-10-06 02:37:12,664 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 104-105: illegal UTF-16 surrogate
2023-10-06 02:37:12,664 | Level 5 | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:37:12,664 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:37:12,664 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:37:12,664 | Level 5 | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xc4 in position 0: unexpected special character
2023-10-06 02:37:12,664 | DEBUG | Encoding detection: Found cp949 as plausible (best-candidate) for content. With 4 alternatives.
{
    "path": "/home/kzrnm/workspace/competitive-verifier/examples/tests/encoding/EUC-KR.txt",
    "encoding": "cp949",
    "encoding_aliases": [
        "949",
        "ms949",
        "uhc"
    ],
    "alternative_encodings": [
        "euc_kr"
    ],
    "language": "Korean",
    "alphabets": [
        "Basic Latin",
        "Hangul Syllables",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 16.44,
    "unicode_path": null,
    "is_preferred": true
}

Desktop (please complete the following information):

  • OS: Ubuntu (WSL on Windows11)
  • Python version : 3.10
  • Package version: 3.3

Additional context
Add any other context about the problem here.

@kzrnm kzrnm added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Oct 5, 2023
@Ousret
Copy link
Member

Ousret commented Oct 19, 2023

I could reproduce this and propose a patch that improves the situation.
It will be available in the next minor.

The file will be kept in our data collection if you don't oppose it.

@Ousret Ousret linked a pull request Oct 19, 2023 that will close this issue
@Ousret Ousret removed the help wanted Extra attention is needed label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants