-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Detecting encoding of subtitle files automatically #908
Comments
This is already implemented. Though it requires that mpv is compiled with libenca. Also, the detection isn't too reliable. Do you happen to know what mplayerx uses for detection? |
It seems mplayerx uses UniversalDetector (https://github.com/siuying/UniversalDetector), which is an Objective-C wrapper of uchardet (https://code.google.com/p/uchardet/), to determine the encoding of the sub. uchardet itself is based on C++ implementation of the universal charset detection library by Mozilla (http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/). |
Did you check whether your mpv is compiled against ENCA? Are the ENCA results worse (in your case)? |
I am using ChrisK's Mac version. It seems it is not compiled with ENCA support so I cannot test which one is better. I currently use "sub-codepage = utf8:gb18030" in my config file so that subtitles encoded in GB18030 would have higher priority than UTF-8, however I do also have a few subtitles encoded in JIS which I manually converted them to UTF-8. In an ideal world, people should all start using unicode... |
Well then, I'd say @ChrisK2 should make a build with ENCA enabled, so we can see if it's sufficient. If not, we can think about uchardet support. |
Is there a runtime way to know whether or not the mpv binary is compiled with ENCA on? Also I read in the ENCA package description on Fedora: « Currently, it has support for Belarussian, Bulgarian, Croatian, Czech, Estonian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese and some multibyte encodings (mostly variants of Unicode) On the other side, uchardet seems to support quite a bit of encoding, and I imagine that Mozilla Firefox algorithm should not be too crappy if they want to be used in every country in the world. I am myself regularly reading Korean subtitles, and sometimes Japanese subtitles. And this is annoying that each time they were not written in UTF-8, I have to get the command line out to specify an encoding. This has been bothering me for years (in mplayer as well, and in other video players under Linux; most FLOSS software are not very "encoding-friendly" to non-westerners). So could we reopen the current feature request? :-) |
Hello again, I think my previous message answered the "info-needed", didn't it? |
Once implemented, this indeed could become a selling-point for mpv. Web browsers do character encoding auto-detection all the time. Rarely any video player does that. |
And they misdetect it all the time too. |
Based on my own experience (which may not be true in general), I feel that it has been improved greatly in the recent years. There used to be some issues in detecting Chinese vs. Japanese pages in browsers but rarely does it still happen. I am not quite sure about other languages though. |
But is that work included in uchardet? (And which one of the apparently numerous forks?) |
Well it can't be worse than the current situation. Right now, I get 100% detection failure on Japanese and Korean subtitles as soon as they don't use UTF-8. Also by checking further, I see that On the other hand, I tested the chardetect utility which is a re-implementation of the Mozilla auto-detection code in python and it gave me 100% good answers on a dozen of subtitle files (some using UTF-8, others EUC-KR or CP949) with a 0.99 confidence. This said, I have no preference. If you know another encoding detection lib which would perform even better, I'd be thrilled! :-) |
The current version of uchardet (https://github.com/BYVoid/uchardet) no longer
fails to detect at least one of my Shift_JIS samples, so I guess it actually
did get better than it was a few years ago.
Also, seems like all the relevant Linux distros have packages for it now.
|
If it really helps, I'm not opposed. |
Please test it! |
Hi, I compiled it, and it worked great for a bunch of Korean subtitles which would normally end up garbled text (called "UTF-8-BROKEN" encoding in your code) with the enca detection. P.S.: I had to comment out some mp_dbg() calls which crashed my build, but I don't think this was related. Cf. #2186. |
One of the front-ends of mplayer on OSX, MplayerX (http://mplayerx.org/), has this function. It detects the encoding of subtitle files automatically (even if it is an external file), and displays them correctly without the need of users to manually convert the encoding. I hope mpv could also implement such function (maybe with an on/off option). It would be really helpful since most SRT subtitle files that you can find on the Internet are still encoded in the default encoding of their languages (eg: JIS for Japanese, GB2312 for Chinese).
The text was updated successfully, but these errors were encountered: