[Request] Detecting encoding of subtitle files automatically #908

hkenneth · 2014-07-03T22:54:12Z

One of the front-ends of mplayer on OSX, MplayerX (http://mplayerx.org/), has this function. It detects the encoding of subtitle files automatically (even if it is an external file), and displays them correctly without the need of users to manually convert the encoding. I hope mpv could also implement such function (maybe with an on/off option). It would be really helpful since most SRT subtitle files that you can find on the Internet are still encoded in the default encoding of their languages (eg: JIS for Japanese, GB2312 for Chinese).

ghost · 2014-07-03T22:59:09Z

This is already implemented. Though it requires that mpv is compiled with libenca. Also, the detection isn't too reliable. Do you happen to know what mplayerx uses for detection?

hkenneth · 2014-07-03T23:11:11Z

It seems mplayerx uses UniversalDetector (https://github.com/siuying/UniversalDetector), which is an Objective-C wrapper of uchardet (https://code.google.com/p/uchardet/), to determine the encoding of the sub. uchardet itself is based on C++ implementation of the universal charset detection library by Mozilla (http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/).

ghost · 2014-07-05T15:26:16Z

Did you check whether your mpv is compiled against ENCA? Are the ENCA results worse (in your case)?

hkenneth · 2014-07-05T20:30:10Z

I am using ChrisK's Mac version. It seems it is not compiled with ENCA support so I cannot test which one is better. I currently use "sub-codepage = utf8:gb18030" in my config file so that subtitles encoded in GB18030 would have higher priority than UTF-8, however I do also have a few subtitles encoded in JIS which I manually converted them to UTF-8. In an ideal world, people should all start using unicode...

ghost · 2014-07-06T20:13:47Z

Well then, I'd say @ChrisK2 should make a build with ENCA enabled, so we can see if it's sufficient.

If not, we can think about uchardet support.

Jehan · 2015-07-23T09:52:08Z

Is there a runtime way to know whether or not the mpv binary is compiled with ENCA on?
I am using the rpmfusion build for Fedora 22. I have no idea where to find the spec file which made the mpv RPM, but when I check ENCA to be uninstalled, it requires me to uninstall mpv as well. So I assume the rpmfusion build is compiled with ENCA on.

Also I read in the ENCA package description on Fedora: « Currently, it has support for Belarussian, Bulgarian, Croatian, Czech, Estonian, Latvian, Lithuanian, Polish, Russian, Slovak, Slovene, Ukrainian, Chinese and some multibyte encodings (mostly variants of Unicode)
independent on the language. »
That seems quite limited (and in particular in the Asian side, except for Chinese), with mostly Latin and Cyrillic languages. Moreover from the last release post: « enca is in maintenance mode only and I have no intentions to write new features »

On the other side, uchardet seems to support quite a bit of encoding, and I imagine that Mozilla Firefox algorithm should not be too crappy if they want to be used in every country in the world.

I am myself regularly reading Korean subtitles, and sometimes Japanese subtitles. And this is annoying that each time they were not written in UTF-8, I have to get the command line out to specify an encoding. This has been bothering me for years (in mplayer as well, and in other video players under Linux; most FLOSS software are not very "encoding-friendly" to non-westerners).
Of course in a perfect world, everybody would just switch to UTF-8, but in the current flawed world, a lot of people are still writing subtitles with other encoding (in Asian countries at least).

So could we reopen the current feature request? :-)
Thanks!

Jehan · 2015-07-30T11:49:35Z

Hello again,

I think my previous message answered the "info-needed", didn't it?
Could this issue be reopened to add uchardet support? (or any other lib which works better than ENCA)

zhen-huan-hu · 2015-07-30T15:33:26Z

Once implemented, this indeed could become a selling-point for mpv. Web browsers do character encoding auto-detection all the time. Rarely any video player does that.

ghost · 2015-07-30T15:40:14Z

Web browsers do character encoding auto-detection all the time.

And they misdetect it all the time too.

zhen-huan-hu · 2015-07-30T15:50:38Z

Based on my own experience (which may not be true in general), I feel that it has been improved greatly in the recent years. There used to be some issues in detecting Chinese vs. Japanese pages in browsers but rarely does it still happen. I am not quite sure about other languages though.

ghost · 2015-07-30T15:55:24Z

I feel that it has been improved greatly in the recent years.

But is that work included in uchardet? (And which one of the apparently numerous forks?)

Jehan · 2015-07-30T16:04:47Z

And they misdetect it all the time too.

Well it can't be worse than the current situation. Right now, I get 100% detection failure on Japanese and Korean subtitles as soon as they don't use UTF-8.

Also by checking further, I see that enca --list languages would give a list of languages which confirms the website. So basically that's "normal" that detection fails all the time because Japanese and Korean are indeed not supported.

On the other hand, I tested the chardetect utility which is a re-implementation of the Mozilla auto-detection code in python and it gave me 100% good answers on a dozen of subtitle files (some using UTF-8, others EUC-KR or CP949) with a 0.99 confidence.
So maybe on other languages it may not be perfect, I won't pretend that my small subset of subtitles is relevant to the whole world, but it sure looks better in my particular case.
And assuming that uchardet (based on the same algorithm) works the same, looks like it would be a lot better than present situation.

This said, I have no preference. If you know another encoding detection lib which would perform even better, I'd be thrilled! :-)

mia-0 · 2015-07-30T16:05:40Z

The current version of uchardet (https://github.com/BYVoid/uchardet) no longer fails to detect at least one of my Shift_JIS samples, so I guess it actually did get better than it was a few years ago. Also, seems like all the relevant Linux distros have packages for it now.

ghost · 2015-07-30T16:17:19Z

If it really helps, I'm not opposed.

ghost · 2015-08-01T22:22:14Z

Please test it!

Jehan · 2015-08-02T16:44:43Z

Hi,

I compiled it, and it worked great for a bunch of Korean subtitles which would normally end up garbled text (called "UTF-8-BROKEN" encoding in your code) with the enca detection.
I can't say for everybody else, but I sure hope it will end up soon becoming the default encoding detection. In particular, I want to just be able to drag'n drop files there, without wondering whether they are UTF-8 or not, and not go into the terminal to play videos.

P.S.: I had to comment out some mp_dbg() calls which crashed my build, but I don't think this was related. Cf. #2186.

ghost added the meta:info-needed label Jul 18, 2015

ghost closed this as completed Jul 18, 2015

ghost reopened this Jul 30, 2015

ghost added priority:stalled and removed meta:info-needed labels Jul 30, 2015

ghost closed this as completed in a74914a Aug 1, 2015

ghost added meta:feature-request and removed priority:stalled labels Aug 1, 2015

Jehan mentioned this issue Aug 2, 2015

Logging code crashes #2186

Closed

Jehan mentioned this issue Aug 4, 2015

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Closed

adipose mentioned this issue Feb 9, 2024

unwanted character rendering of subtitles with incorrectly defined character set clsid2/mpc-hc#2548

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Detecting encoding of subtitle files automatically #908

[Request] Detecting encoding of subtitle files automatically #908

hkenneth commented Jul 3, 2014

ghost commented Jul 3, 2014

hkenneth commented Jul 3, 2014

ghost commented Jul 5, 2014

hkenneth commented Jul 5, 2014

ghost commented Jul 6, 2014

Jehan commented Jul 23, 2015

Jehan commented Jul 30, 2015

zhen-huan-hu commented Jul 30, 2015

ghost commented Jul 30, 2015

zhen-huan-hu commented Jul 30, 2015

ghost commented Jul 30, 2015

Jehan commented Jul 30, 2015

mia-0 commented Jul 30, 2015 via email

ghost commented Jul 30, 2015

ghost commented Aug 1, 2015

Jehan commented Aug 2, 2015

[Request] Detecting encoding of subtitle files automatically #908

[Request] Detecting encoding of subtitle files automatically #908

Comments

hkenneth commented Jul 3, 2014

ghost commented Jul 3, 2014

hkenneth commented Jul 3, 2014

ghost commented Jul 5, 2014

hkenneth commented Jul 5, 2014

ghost commented Jul 6, 2014

Jehan commented Jul 23, 2015

Jehan commented Jul 30, 2015

zhen-huan-hu commented Jul 30, 2015

ghost commented Jul 30, 2015

zhen-huan-hu commented Jul 30, 2015

ghost commented Jul 30, 2015

Jehan commented Jul 30, 2015

mia-0 commented Jul 30, 2015 via email

ghost commented Jul 30, 2015

ghost commented Aug 1, 2015

Jehan commented Aug 2, 2015