Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] utf-8 misdetected as cp1256 #174

Closed
nijel opened this issue Mar 22, 2022 · 5 comments · Fixed by #175
Closed

[BUG] utf-8 misdetected as cp1256 #174

nijel opened this issue Mar 22, 2022 · 5 comments · Fixed by #175
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed

Comments

@nijel
Copy link
Contributor

nijel commented Mar 22, 2022

Describe the bug
File is detected as cp1256 while it is acutally utf-8.

To Reproduce
file.txt (the file is anonymized for privacy reasons)

Expected behavior
utf-8 should be detected.

Logs

$ normalizer /tmp/file.txt 
{
    "path": "/tmp/file.txt",
    "encoding": "cp1256",
    "encoding_aliases": [
        "1256",
        "windows_1256"
    ],
    "alternative_encodings": [],
    "language": "Farsi",
    "alphabets": [
        "Arabic",
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin Extended-B",
        "Latin-1 Supplement",
        "Letterlike Symbols"
    ],
    "has_sig_or_bom": false,
    "chaos": 2.32,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.9.2
  • Package version 2.0.12

Additional context
chardet works fine on this file:

$ chardet /tmp/file.txt 
/tmp/file.txt: utf-8 with confidence 0.99
@nijel nijel added bug Something isn't working help wanted Extra attention is needed labels Mar 22, 2022
@Ousret
Copy link
Collaborator

Ousret commented Mar 22, 2022

The file you provided is detected as utf_8 but I guess this is because you altered the file.

[~]$ normalizer --verbose file.txt 
2022-03-22 15:43:29,893 | WARNING | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc4 in position 78: ordinal not in range(128)
2022-03-22 15:43:29,893 | INFO | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-03-22 15:43:29,902 | INFO | utf_8 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-03-22 15:43:29,903 | INFO | utf_8 is most likely the one. Stopping the process.
{
    "path": "/home/ahmed/file.txt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [
        "Arabic",
        "Basic Latin",
        "CJK Unified Ideographs",
        "Control character",
        "Cyrillic",
        "Hebrew",
        "Latin Extended-A",
        "Latin-1 Supplement",
        "Mathematical Operators"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

By reading the JSON output you've given, I suspect that the language detector find a near perfect match for Farsi even though its not and bypass the "chaos": 2.32, huge value.

I cannot do anything without the original file, feel free to pass it through mail directly.

@Ousret Ousret added detection Related to the charset detection mechanism, chaos/mess/coherence and removed bug Something isn't working labels Mar 22, 2022
@nijel
Copy link
Contributor Author

nijel commented Mar 23, 2022

Even if I download it here, I get the same results. Here it is with verbose flag:

$ normalizer --verbose /tmp/file2.txt 
2022-03-23 08:43:52,780 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc4 in position 78: ordinal not in range(128)
2022-03-23 08:43:52,780 | Level 5 | Code page utf_8 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-03-23 08:43:52,786 | Level 5 | utf_8 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 12.640000 %.
2022-03-23 08:43:52,786 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,786 | Level 5 | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,787 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 714.550000 %.
2022-03-23 08:43:52,787 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,790 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 20.033000 %.
2022-03-23 08:43:52,790 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,790 | Level 5 | Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,791 | Level 5 | Code page cp1251 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 1287: character maps to <undefined>
2022-03-23 08:43:52,791 | Level 5 | Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,791 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,791 | Level 5 | Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,791 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,799 | Level 5 | cp1256 passed initial chaos probing. Mean measured chaos is 2.320000 %
2022-03-23 08:43:52,800 | Level 5 | cp1256 should target any language(s) of ['Farsi', 'Arabic']
2022-03-23 08:43:52,801 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,801 | Level 5 | Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 663: character maps to <undefined>
2022-03-23 08:43:52,801 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,801 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9a in position 1119: character maps to <undefined>
2022-03-23 08:43:52,804 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 16.200000 %.
2022-03-23 08:43:52,805 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,807 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 16.200000 %.
2022-03-23 08:43:52,807 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,808 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 16.200000 %.
2022-03-23 08:43:52,811 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 20.033000 %.
2022-03-23 08:43:52,813 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 16.200000 %.
2022-03-23 08:43:52,813 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,814 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,814 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,814 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,814 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,815 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x9c in position 1432: character maps to <undefined>
2022-03-23 08:43:52,815 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,815 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,815 | Level 5 | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x84 in position 79: character maps to <undefined>
2022-03-23 08:43:52,816 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0x86 in position 129: illegal multibyte sequence
2022-03-23 08:43:52,816 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xe2 in position 1255: illegal multibyte sequence
2022-03-23 08:43:52,816 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,817 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,817 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,817 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,817 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,818 | Level 5 | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xa4 in position 1257: illegal multibyte sequence
2022-03-23 08:43:52,818 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,818 | Level 5 | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xa4 in position 1257: illegal multibyte sequence
2022-03-23 08:43:52,820 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.267000 %.
2022-03-23 08:43:52,821 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,821 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,821 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,821 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,822 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,822 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,822 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,822 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xc4 in position 78: illegal multibyte sequence
2022-03-23 08:43:52,825 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.267000 %.
2022-03-23 08:43:52,825 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xdb in position 1535: character maps to <undefined>
2022-03-23 08:43:52,825 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.267000 %.
2022-03-23 08:43:52,826 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,826 | Level 5 | iso8859_15 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,828 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.533000 %.
2022-03-23 08:43:52,829 | Level 5 | iso8859_2 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,829 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 104: character maps to <undefined>
2022-03-23 08:43:52,829 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,831 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.267000 %.
2022-03-23 08:43:52,832 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa9 in position 817: character maps to <undefined>
2022-03-23 08:43:52,834 | Level 5 | iso8859_7 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.267000 %.
2022-03-23 08:43:52,834 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc4 in position 78: character maps to <undefined>
2022-03-23 08:43:52,834 | Level 5 | iso8859_9 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,835 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xe2 in position 1255: illegal multibyte sequence
2022-03-23 08:43:52,837 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 20.033000 %.
2022-03-23 08:43:52,838 | Level 5 | Code page kz1048 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 1287: character maps to <undefined>
2022-03-23 08:43:52,838 | Level 5 | latin_1 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-03-23 08:43:52,840 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 11.367000 %.
2022-03-23 08:43:52,843 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 9.600000 %.
2022-03-23 08:43:52,845 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 11.233000 %.
2022-03-23 08:43:52,849 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 13.350000 %.
2022-03-23 08:43:52,849 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2022-03-23 08:43:52,850 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2022-03-23 08:43:52,856 | Level 5 | ptcp154 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 5.660000 %.
2022-03-23 08:43:52,856 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0x86 in position 129: illegal multibyte sequence
2022-03-23 08:43:52,856 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0x86 in position 129: illegal multibyte sequence
2022-03-23 08:43:52,857 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0x86 in position 129: illegal multibyte sequence
2022-03-23 08:43:52,857 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa0 in position 1378: character maps to <undefined>
2022-03-23 08:43:52,857 | Level 5 | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2022-03-23 08:43:52,857 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode byte 0x0a in position 8160: truncated data
2022-03-23 08:43:52,857 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 1526-1527: illegal UTF-16 surrogate
2022-03-23 08:43:52,857 | Level 5 | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2022-03-23 08:43:52,858 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2022-03-23 08:43:52,858 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2022-03-23 08:43:52,858 | Level 5 | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xc4 in position 78: unexpected special character
2022-03-23 08:43:52,858 | DEBUG | Encoding detection: Found cp1256 as plausible (best-candidate) for content. With 0 alternatives.
{
    "path": "/tmp/file2.txt",
    "encoding": "cp1256",
    "encoding_aliases": [
        "1256",
        "windows_1256"
    ],
    "alternative_encodings": [],
    "language": "Farsi",
    "alphabets": [
        "Arabic",
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin Extended-B",
        "Latin-1 Supplement",
        "Letterlike Symbols"
    ],
    "has_sig_or_bom": false,
    "chaos": 2.32,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

The online service at https://charsetnormalizerweb-ousret.vercel.app/ also detects it as cp1256.

The zipped file can be downloaded here: https://tmp.cihar.com/file.zip

@nijel
Copy link
Contributor Author

nijel commented Mar 23, 2022

After looking deeper at it, the problem is probably that chunk_size badly aligns with placement of multibyte utf-8 characters. For example, with chunk_size=500 it detects correctly:

{
    "path": "/tmp/file.txt",
    "encoding": "utf_8",
    "encoding_aliases": [
        "u8",
        "utf",
        "utf8",
        "utf8_ucs2",
        "utf8_ucs4",
        "cp65001"
    ],
    "alternative_encodings": [],
    "language": "Unknown",
    "alphabets": [
        "Arabic",
        "Basic Latin",
        "CJK Unified Ideographs",
        "Control character",
        "Cyrillic",
        "Hebrew",
        "Latin Extended-A",
        "Latin-1 Supplement",
        "Mathematical Operators"
    ],
    "has_sig_or_bom": false,
    "chaos": 10.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

The issue seems that "multi-byte bad cutting detector and adjustment" only fixes errors at the beginning of the chunk, but not in the end.

nijel added a commit to nijel/charset_normalizer that referenced this issue Mar 24, 2022
This avoids issues with detecting string boundaries while improving
performace (avoids multiple decoding of the buffer).

Fixes jawah#174
nijel added a commit to nijel/charset_normalizer that referenced this issue Mar 24, 2022
This avoids issues with detecting string boundaries while improving
performance (avoids multiple decoding of the sequence).

Fixes jawah#174
nijel added a commit to nijel/charset_normalizer that referenced this issue Mar 24, 2022
This avoids issues with detecting string boundaries while improving
performance (avoids multiple decoding of the sequence).

Fixes jawah#174
nijel added a commit to nijel/charset_normalizer that referenced this issue Mar 24, 2022
This avoids issues with detecting string boundaries while improving
performance (avoids multiple decoding of the sequence).

Fixes jawah#174
@Ousret
Copy link
Collaborator

Ousret commented Mar 24, 2022

My bad, the file was accidentally modified during in-flight download (my side).

This file seems to be particularly challenging for a charset-detector.

Chunk extraction

Here are the chunks extracted (some of them):

First:

"xxxxxx";"xxxxxx"
"XX-X00-00";"XXXXXX XXXXXXXXXXXXXXX XXXX"
"XX-X00-00";"XXXĄXXXXXX"
"XX-X00-00";"XXÓXXX"
"XX-X00-00";"XXXĆ"
"XX-X00-00";"XX"
"XX-X00-00";"XXXX. XXXXXX."
"XX-X00-00";"XXXXXX XX"
"XX-X00-00";"XXX"
"XX-X00-00";"XXX XXXXX"
"XX-X00-00";"XXXXXXXXXXXXXX"
"XX-X00-00";"XXXXXXXXXXXX"
"XX-X00-00";"XXX XXX"
"XX-X00-00";"XXXXXX XX"
"XX-X00-00";"XXXXXXXXXĆ"
"XX-X00-00";"XXXXX"
"XX-X00-00";"XXXXXXXXXX X XXXĄXXXXXX"
"XX-X00-00";"XXXXX XXXXX XXXXXX XXX"
"XX-X00-00";"XXXXXXXX: X00.00"

Fourth one:

00";"XXX:"
"XX-X00-00";"XXXXXXXXX XXŁĄXXXŃ:"
"XX-X00-00";"XXXŁX XXXXX:"
"XX-X00-00";"Xxxxx xxxxxxxxxxxx:"
"XX-X00-00";"XXXXXX"
"XX-X00-00";"XXXXXXX"
"XX-X00-00";"XXXXXXXX"
"XX-X00-00";"XXXXŁXXXX XXXXŃXXXXX"
"XX-X00-00";"XŻXX XŁXŚXXXXXX XXXXX XXX."
"XX-X00-00";"XX"
"XX-X00-00";"XXXXXX"
"XX-X00-00";"XXXXXXXXXX XXXX XXX"
"XX-X00-00";"XXXXXX XXX:"
"XX-X00-00";"XXXX XXX:"
"XX-X00-00";"XXXXXXXXŹ XXX:"
"XX-X00-00";"XXXXXX"
"XX-X00-00";"XXXXXX XX"
"XX-X00-00";"XŁĄXX"
"XX-X00-00";"XXXX:

Lastly:

-00";"XXXXX XXXX"
"X-X00-00";"XXXXXXXXX"
"X-X00-00";"XXXXXXXŹ XXXXX X XXXXŚXXX XXXX"
"X-X00-00";"XXXXX XXXXXXX"
"X-X00-00";"XXXXXX"
"X-X00-00";"XXŁĄXXXXX XXXXXXXX"
"X-X00-00";"XXXXXXX X"
"X-X00-00";"XXX XXXXXXXXX"
"X-X00-00";"XXXXXXX XXXXŃXXXXX"
"X-X00-00";"XXXXX XXXĘXX"
"X-X00-00";"XXXXX XXX XXXXXXXX"
"X-X00-00";"XXXXXXXX XX"
"X-X00-00";"XXXXXXXXXX XXXXXXXX"
"X-X00-00";"XXXXŁXŚXXXX XXXXX"
"X-X00-00";"XXXXXĘ XXXXXĆ XX XXXXĘ"
"X-X00-00";"XXXXXXXX XXXXX XXX XXXXX"
"X-X00-00";"XXXXXX

The immediate thing that can be observed is that there isn't much to observe in it. Language-wise.

Mess-detector

The first pass immediately trigger the SuperWeirdWordPlugin with the fourth and last one.

# First
<class 'charset_normalizer.md.SuperWeirdWordPlugin'> 0.056910569105691054
# Fourth
<class 'charset_normalizer.md.SuperWeirdWordPlugin'> 0.3076923076923077
# Last
<class 'charset_normalizer.md.SuperWeirdWordPlugin'> 0.26666666666666666

And the language detection fail to detect any suitable match..

Here are the "word" that are considered too suspicious. And I have to agree with it.

  • BAD XXXĆ
  • BAD XXXXXXXXXĆ
  • BAD XXŁĄXXXŃ
  • BAD XXXXXXXŹ

So, now we have more material to assess what is going on.

erratum: I can see that you've taken the time to find a solution, I will look at it.

@nijel
Copy link
Contributor Author

nijel commented Mar 24, 2022

My PR is merely a workaround for short sequences (it reuses decoded_payload and splits that instead of decoding it again). I've also added more real-world test file in the PR.

Ousret added a commit that referenced this issue Jun 18, 2022
* Re-use decoded buffer for short texts

This avoids issues with detecting string boundaries while improving
performance (avoids multiple decoding of the sequence).

Fixes #174

* 🔖 Bump version to 2.1.0.dev0

* 🐛 Workaround a potential bug in Python isspace table character

 bug discovered in Python, Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.


Co-authored-by: TAHRI Ahmed R <Ousret@users.noreply.github.com>
Co-authored-by: Ahmed TAHRI <ahmed.tahri@cloudnursery.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants