Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong character encoding #3

Closed
Sanyy opened this issue Feb 24, 2020 · 8 comments
Closed

Wrong character encoding #3

Sanyy opened this issue Feb 24, 2020 · 8 comments

Comments

@Sanyy
Copy link

Sanyy commented Feb 24, 2020

Hi,

"Strange" characters in hungarian sub these charcters are not shown correctly (only some kind of weird placeholder):

é: U+00E9
á: U+00E1
ő: U+0151
ö: U+00F6
ó: U+00F3
ű: U+0171

@gorhill
Copy link
Owner

gorhill commented Feb 24, 2020

Can't do anything without telling me everything I need to reproduce the issue on my side, without having me to search for anything. Provide details, and URLs to subtitles file which is causing you issue.

@Sanyy
Copy link
Author

Sanyy commented Feb 24, 2020

Movie file:
"local"

URL to subtitle:
https://www.feliratok.info/index.php?action=letolt&fnev=The.Walking.Dead.S10E09.WEBRip.x264-ION10.srt&felirat=1582554037

Played in Firefox and no issue in VLC with the file.

Note: I thought because of Firefox but when i tried to test with Iridium the addon could not catch the video.
-> My fault with Iridium had to enable File access for the addon but "weird" charcters are shown here too.

@gorhill
Copy link
Owner

gorhill commented Feb 24, 2020

Thanks for the information; I can reproduce, I will investigate.

@gorhill
Copy link
Owner

gorhill commented Feb 24, 2020

The file encoding is ISO-8859-1 while I believe the browser uses the encoding of the web page -- likely UTF-8 -- to decode the text file. If I convert the file to UTF-8 (using a simple text editor), the captions render fine.

Not sure what will be the solution for now -- detecting character encoding and converting to utf-8 if needed is not a trivial feature and for now I do not have much time to undertake this. As a workaround maybe see if your text editor can convert to utf-8.

@gorhill
Copy link
Owner

gorhill commented Feb 24, 2020

Reference, https://en.wikipedia.org/wiki/SubRip#Text_encoding:

SubRip's default output encoding is configured as Windows-1252. However, output options are also given for many Windows code pages as well Unicode encodings, such as UTF-8 and UTF-16, with or without Byte Order Mark (BOM). Therefore, there's no de facto character encoding standard for .srt files, which means that any SubRip file parser must attempt to use Charset detection. Unicode Byte Order Mark (BOM) are typically used to aid detection.

@Sanyy
Copy link
Author

Sanyy commented Feb 24, 2020

I contact the guy "who did this". He might pay attention to this in the future.

till then i can convert it too:
iconv -f ISO-8859-1 -t UTF-8 The.Walking.Dead.S10E09.WEBRip.x264-ION10.srt -o The.Walking.Dead.S10E09.WEBRip.x264-ION10-UTF-8.srt

Thank you very much gorhill!

@Sanyy Sanyy closed this as completed Feb 24, 2020
@gorhill gorhill reopened this Feb 24, 2020
@gorhill
Copy link
Owner

gorhill commented Feb 24, 2020

Best to keep this opened, though I don't see the solution as trivial, ideally CCaptioner should seamlessly deal with encoding different than UTF-8.

gorhill added a commit that referenced this issue Feb 26, 2020
@gorhill
Copy link
Owner

gorhill commented Feb 26, 2020

Fixed in 1.1.0.

@gorhill gorhill closed this as completed Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants