We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding is not detected when the input contains multiple encodings
i tried to fix this by calling charset_normalizer.from_bytes with custom threshold steps chunk_size values, but no luck
charset_normalizer.from_bytes
threshold
steps
chunk_size
import charset_normalizer with open("Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt", "rb") as f: file_content = f.read() content_encoding = charset_normalizer.from_bytes(file_content).best() assert content_encoding != None
Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt
note: the input file is broken
$ iconv -f Big5 -t utf8 "Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt" >/dev/null iconv: illegal input sequence at position 4726
this is because opensubtitles.org (which is run by idiots) inserted its advertisment with utf8 encoding (72) into the file with big5 encoding (71)
71 00:04:06,110 --> 00:04:09,960 ���ȧڭ̭��{���O���N�~�j�z�o 72 00:04:11,000 --> 00:04:17,074 想在此处添加您的广告信息?立即联系 www.OpenSubtitles.org
so to fix this case i will have to split the subtitle into "textparts" using pysubs2 and run charset_normalizer.from_bytes on each textpart
or split the input text by lines, and use a logarithmic algorithm to find the positions of the different encodings
probably this (detect multiple encodings) is out of scope for charset_normalizer
charset_normalizer
at least the "main" encoding should be detected and the other encoding should be tolerated as noise
here: detect big5 encoding, tolerate utf8 encoding
1 00:00:00,200 --> 00:00:02,290 <i><font color=dc0808>"第十三號倉庫"</font> 前情提要</i> 2 00:00:03,440 --> 00:00:05,290 我們自己的倉庫醫生 嗯?
>>> charset_normalizer.__version__ '3.0.1'
The text was updated successfully, but these errors were encountered:
Your impression is right. Hybrid file are out of scope as doing "detection" on them would require a substantial effort.
Sorry, something went wrong.
No branches or pull requests
encoding is not detected when the input contains multiple encodings
i tried to fix this by calling
charset_normalizer.from_bytes
with custom
threshold
steps
chunk_size
values, but no luckTo Reproduce
input file
Warehouse.13.S03E03.720p.HDTV.X264-DIMENSION.cht.srt
note: the input file is broken
this is because opensubtitles.org (which is run by idiots)
inserted its advertisment with utf8 encoding (72)
into the file with big5 encoding (71)
so to fix this case
i will have to split the subtitle into "textparts" using pysubs2
and run
charset_normalizer.from_bytes
on each textpartor split the input text by lines, and use a logarithmic algorithm
to find the positions of the different encodings
probably this (detect multiple encodings) is out of scope for
charset_normalizer
Expected behavior
at least the "main" encoding should be detected
and the other encoding should be tolerated as noise
here: detect big5 encoding, tolerate utf8 encoding
env
The text was updated successfully, but these errors were encountered: