[BUG] encoding is not detected when the input contains multiple encodings #405

milahu · 2023-12-30T14:51:34Z

encoding is not detected when the input contains multiple encodings

i tried to fix this by calling charset_normalizer.from_bytes
with custom threshold steps chunk_size values, but no luck

To Reproduce

import charset_normalizer
with open("Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt", "rb") as f:
  file_content = f.read()
content_encoding = charset_normalizer.from_bytes(file_content).best()
assert content_encoding != None

input file

Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt

note: the input file is broken

$ iconv -f Big5 -t utf8 "Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt" >/dev/null 
iconv: illegal input sequence at position 4726

this is because opensubtitles.org (which is run by idiots)
inserted its advertisment with utf8 encoding (72)
into the file with big5 encoding (71)

71
00:04:06,110 --> 00:04:09,960
���ȧڭ̭��{���O���N�~�j�z�o

72
00:04:11,000 --> 00:04:17,074
想在此处添加您的广告信息？立即联系 www.OpenSubtitles.org

so to fix this case
i will have to split the subtitle into "textparts" using pysubs2
and run charset_normalizer.from_bytes on each textpart

or split the input text by lines, and use a logarithmic algorithm
to find the positions of the different encodings

probably this (detect multiple encodings) is out of scope for charset_normalizer

Expected behavior

at least the "main" encoding should be detected
and the other encoding should be tolerated as noise

here: detect big5 encoding, tolerate utf8 encoding

1
00:00:00,200 --> 00:00:02,290
<i><font color=dc0808>"第十三號倉庫"</font> 前情提要</i>

2
00:00:03,440 --> 00:00:05,290
我們自己的倉庫醫生 嗯?

env

>>> charset_normalizer.__version__
'3.0.1'

The text was updated successfully, but these errors were encountered:

Ousret · 2024-01-02T03:58:46Z

probably this (detect multiple encodings) is out of scope for charset_normalizer

Your impression is right. Hybrid file are out of scope as doing "detection" on them would require a substantial effort.

milahu added bug Something isn't working help wanted Extra attention is needed labels Dec 30, 2023

Ousret closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024

Ousret removed the help wanted Extra attention is needed label Feb 20, 2024

milahu mentioned this issue Mar 10, 2024

Better handling of files with unknown character encoding tkarabela/pysubs2#43

Open

milahu mentioned this issue Jun 10, 2024

[XDCC] 'utf-8' codec can't decode byte pyload/pyload#4290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] encoding is not detected when the input contains multiple encodings #405

[BUG] encoding is not detected when the input contains multiple encodings #405

milahu commented Dec 30, 2023 •

edited

Loading

Ousret commented Jan 2, 2024

[BUG] encoding is not detected when the input contains multiple encodings #405

[BUG] encoding is not detected when the input contains multiple encodings #405

Comments

milahu commented Dec 30, 2023 • edited Loading

To Reproduce

input file

Expected behavior

env

Ousret commented Jan 2, 2024

milahu commented Dec 30, 2023 •

edited

Loading