Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] encoding is not detected when the input contains multiple encodings #405

Closed
milahu opened this issue Dec 30, 2023 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@milahu
Copy link

milahu commented Dec 30, 2023

encoding is not detected when the input contains multiple encodings

i tried to fix this by calling charset_normalizer.from_bytes
with custom threshold steps chunk_size values, but no luck

To Reproduce

import charset_normalizer
with open("Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt", "rb") as f:
  file_content = f.read()
content_encoding = charset_normalizer.from_bytes(file_content).best()
assert content_encoding != None

input file

Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt

note: the input file is broken

$ iconv -f Big5 -t utf8 "Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt" >/dev/null 
iconv: illegal input sequence at position 4726

this is because opensubtitles.org (which is run by idiots)
inserted its advertisment with utf8 encoding (72)
into the file with big5 encoding (71)

71
00:04:06,110 --> 00:04:09,960
���ȧڭ̭��{���O���N�~�j�z�o

72
00:04:11,000 --> 00:04:17,074
想在此处添加您的广告信息?立即联系 www.OpenSubtitles.org

so to fix this case
i will have to split the subtitle into "textparts" using pysubs2
and run charset_normalizer.from_bytes on each textpart

or split the input text by lines, and use a logarithmic algorithm
to find the positions of the different encodings

probably this (detect multiple encodings) is out of scope for charset_normalizer

Expected behavior

at least the "main" encoding should be detected
and the other encoding should be tolerated as noise

here: detect big5 encoding, tolerate utf8 encoding

1
00:00:00,200 --> 00:00:02,290
<i><font color=dc0808>"第十三號倉庫"</font> 前情提要</i>

2
00:00:03,440 --> 00:00:05,290
我們自己的倉庫醫生 嗯?

env

>>> charset_normalizer.__version__
'3.0.1'
@milahu milahu added bug Something isn't working help wanted Extra attention is needed labels Dec 30, 2023
@Ousret
Copy link
Member

Ousret commented Jan 2, 2024

probably this (detect multiple encodings) is out of scope for charset_normalizer

Your impression is right. Hybrid file are out of scope as doing "detection" on them would require a substantial effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants