Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Detecting encoding of subtitle files automatically #908

Closed
hkenneth opened this issue Jul 3, 2014 · 16 comments
Closed

[Request] Detecting encoding of subtitle files automatically #908

hkenneth opened this issue Jul 3, 2014 · 16 comments

Comments

@hkenneth
Copy link

hkenneth commented Jul 3, 2014

One of the front-ends of mplayer on OSX, MplayerX (http://mplayerx.org/), has this function. It detects the encoding of subtitle files automatically (even if it is an external file), and displays them correctly without the need of users to manually convert the encoding. I hope mpv could also implement such function (maybe with an on/off option). It would be really helpful since most SRT subtitle files that you can find on the Internet are still encoded in the default encoding of their languages (eg: JIS for Japanese, GB2312 for Chinese).

@ghost
Copy link

ghost commented Jul 3, 2014

This is already implemented. Though it requires that mpv is compiled with libenca. Also, the detection isn't too reliable. Do you happen to know what mplayerx uses for detection?

@hkenneth
Copy link
Author

hkenneth commented Jul 3, 2014

It seems mplayerx uses UniversalDetector (https://github.com/siuying/UniversalDetector), which is an Objective-C wrapper of uchardet (https://code.google.com/p/uchardet/), to determine the encoding of the sub. uchardet itself is based on C++ implementation of the universal charset detection library by Mozilla (http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/).

@ghost
Copy link

ghost commented Jul 5, 2014

Did you check whether your mpv is compiled against ENCA? Are the ENCA results worse (in your case)?

@hkenneth
Copy link
Author

hkenneth commented Jul 5, 2014

I am using ChrisK's Mac version. It seems it is not compiled with ENCA support so I cannot test which one is better. I currently use "sub-codepage = utf8:gb18030" in my config file so that subtitles encoded in GB18030 would have higher priority than UTF-8, however I do also have a few subtitles encoded in JIS which I manually converted them to UTF-8. In an ideal world, people should all start using unicode...

@ghost
Copy link

ghost commented Jul 6, 2014

Well then, I'd say @ChrisK2 should make a build with ENCA enabled, so we can see if it's sufficient.

If not, we can think about uchardet support.

@ghost ghost added the meta:info-needed label Jul 18, 2015
@ghost ghost closed this as completed Jul 18, 2015
@Jehan
Copy link

Jehan commented Jul 23, 2015

Is there a runtime way to know whether or not the mpv binary is compiled with ENCA on?
I am using the rpmfusion build for Fedora 22. I have no idea where to find the spec file which made the mpv RPM, but when I check ENCA to be uninstalled, it requires me to uninstall mpv as well. So I assume the rpmfusion build is compiled with ENCA on.

Also I read in the ENCA package description on Fedora: « Currently, it has support for Belarussian, Bulgarian, Croatian, Czech, Estonian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese and some multibyte encodings (mostly variants of Unicode)
independent on the language. »
That seems quite limited (and in particular in the Asian side, except for Chinese), with mostly Latin and Cyrillic languages. Moreover from the last release post: « enca is in maintenance mode only and I have no intentions to write new features »

On the other side, uchardet seems to support quite a bit of encoding, and I imagine that Mozilla Firefox algorithm should not be too crappy if they want to be used in every country in the world.

I am myself regularly reading Korean subtitles, and sometimes Japanese subtitles. And this is annoying that each time they were not written in UTF-8, I have to get the command line out to specify an encoding. This has been bothering me for years (in mplayer as well, and in other video players under Linux; most FLOSS software are not very "encoding-friendly" to non-westerners).
Of course in a perfect world, everybody would just switch to UTF-8, but in the current flawed world, a lot of people are still writing subtitles with other encoding (in Asian countries at least).

So could we reopen the current feature request? :-)
Thanks!

@Jehan
Copy link

Jehan commented Jul 30, 2015

Hello again,

I think my previous message answered the "info-needed", didn't it?
Could this issue be reopened to add uchardet support? (or any other lib which works better than ENCA)

@ghost ghost reopened this Jul 30, 2015
@zhen-huan-hu
Copy link

Once implemented, this indeed could become a selling-point for mpv. Web browsers do character encoding auto-detection all the time. Rarely any video player does that.

@ghost
Copy link

ghost commented Jul 30, 2015

Web browsers do character encoding auto-detection all the time.

And they misdetect it all the time too.

@zhen-huan-hu
Copy link

Based on my own experience (which may not be true in general), I feel that it has been improved greatly in the recent years. There used to be some issues in detecting Chinese vs. Japanese pages in browsers but rarely does it still happen. I am not quite sure about other languages though.

@ghost
Copy link

ghost commented Jul 30, 2015

I feel that it has been improved greatly in the recent years.

But is that work included in uchardet? (And which one of the apparently numerous forks?)

@Jehan
Copy link

Jehan commented Jul 30, 2015

And they misdetect it all the time too.

Well it can't be worse than the current situation. Right now, I get 100% detection failure on Japanese and Korean subtitles as soon as they don't use UTF-8.

Also by checking further, I see that enca --list languages would give a list of languages which confirms the website. So basically that's "normal" that detection fails all the time because Japanese and Korean are indeed not supported.

On the other hand, I tested the chardetect utility which is a re-implementation of the Mozilla auto-detection code in python and it gave me 100% good answers on a dozen of subtitle files (some using UTF-8, others EUC-KR or CP949) with a 0.99 confidence.
So maybe on other languages it may not be perfect, I won't pretend that my small subset of subtitles is relevant to the whole world, but it sure looks better in my particular case.
And assuming that uchardet (based on the same algorithm) works the same, looks like it would be a lot better than present situation.

This said, I have no preference. If you know another encoding detection lib which would perform even better, I'd be thrilled! :-)

@mia-0
Copy link
Member

mia-0 commented Jul 30, 2015 via email

@ghost
Copy link

ghost commented Jul 30, 2015

If it really helps, I'm not opposed.

@ghost ghost closed this as completed in a74914a Aug 1, 2015
@ghost
Copy link

ghost commented Aug 1, 2015

Please test it!

@Jehan
Copy link

Jehan commented Aug 2, 2015

Hi,

I compiled it, and it worked great for a bunch of Korean subtitles which would normally end up garbled text (called "UTF-8-BROKEN" encoding in your code) with the enca detection.
I can't say for everybody else, but I sure hope it will end up soon becoming the default encoding detection. In particular, I want to just be able to drag'n drop files there, without wondering whether they are UTF-8 or not, and not go into the terminal to play videos.

P.S.: I had to comment out some mp_dbg() calls which crashed my build, but I don't think this was related. Cf. #2186.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants