Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Jehan · 2015-08-04T12:22:47Z

Hi,

I understand that ENCA should still stay the default detection algorithm for the time being (even with uchardet built now in #908).
And actually checking the list of encoding supported by ENCA, even though has much more, there still seems to be some not supported by uchardet (for instance I can't find CP1257/Windows-1257 in uchardet list, unless it has other naming).

But I propose this improvement with the default detection: if ENCA fails, then uchardet is tried.
I tried with some Korean subtitle and it worked great: ENCA not being able to recognize the encoding, it returned NULL and uchardet took the relay, and got the right encoding! :-)

ghost · 2015-08-04T12:47:37Z

Also, no need to separate the code and manpage commits. They belong together.

Jehan · 2015-08-04T13:43:06Z

Commits merged.

As for the comment about replacing enca by uchardet, I won't mind for my own current use cases. But comparing the lists of supported encoding by both libs, it seems that ENCA might still have some encoding not supported by uchardet, in particular CP1257, KOI8-UNI, KOI8-U (just checked a few lines).

I say "might" though because of the encoding name jungle, with encoding having many names, other being compatible with older ones (like GBK is not in uchardet list, but according to Wikipedia GB18030 is a newer standard compatible with GBK), etc. So the lib may complete each other and it is possible that just replacing enca by uchardet, some European users would see regressions.

This is why I went for both ENCA, then uchardet instead, which would cover the union of all supported languages.

Jehan · 2015-08-04T13:44:23Z

Oh well, I see that you removed your comment about replacing ENCA by uchardet.

ghost · 2015-08-04T13:47:21Z

No, I didn't. Github does this when removing commits from a PR when force-pushing to it.

ghost · 2015-08-04T13:53:23Z

I don't like this idea of using ENCA and making uchardet a fallback. This also essentially disables uchardet by default if ENCA isn't compiled.

Jehan · 2015-08-04T14:01:57Z

Github does this when removing commits from a PR when force-pushing to it.

Oh I see. That's confusing for the discussion.

Checking further uchardet, it also seems to miss ISO-8859-2, ISO-8859-4, etc. According to Wikipedia, that covers a whole bunch of Eastern and North Europe languages. It really look like both lib complete each others.

I don't like this idea of using ENCA and making uchardet a fallback.

Well we could do the opposite, I don't mind.

This also essentially disables uchardet by default if ENCA isn't compiled.

No it doesn't. Just to be sure, I even tested right now by rebuilding after disabling enca. You can still add --sub-codepage=uchardet to the command line (and it works), and the default behavior becomes just checking directly though uchardet.

So what do you want to do?

ghost · 2015-08-04T14:06:41Z

Checking further uchardet, it also seems to miss ISO-8859-2, ISO-8859-4, etc. According to Wikipedia, that covers a whole bunch of Eastern and North Europe languages. It really look like both lib complete each others.

Did you test this?

You can still add --sub-codepage=uchardet to the command line (and it works),

Yes, but not with the defaults.

zhen-huan-hu · 2015-08-04T14:12:24Z

Just wanna comment that GB18030 (also called CP54936 by Microsoft, including 27484 Chinese characters) indeed is a superset of GBK (also called CP936 by Microsoft, including 21886 characters) and GB2312 (7445 characters) and compatible with both. So subtitles encoded in GBK or GB2312 (not very likely) can both be decoded properly under the definition of GB18030.

Jehan · 2015-08-04T14:16:57Z

Did you test this?

No because I don't have subtitles from any of these countries in any of these encodings. I will see if I can build one quickly by grabing some text over Wikipedia.

Yes, but not with the defaults.

It did work. I disabled enca, then ran mpv with a EUC-KR subtitle and no command line option, and uchardet detected the encoding.

ghost · 2015-08-04T14:21:01Z

It did work.

OK, I see now.

Jehan · 2015-08-04T15:21:08Z

Ok I created and tested a few files with some Hungarian in ISO-8859-2 and Estonian in both ISO-8859-4 and CP1257. Well ENCA failed. It has never been able to detect ISO-8859-2 without the hint language; and any of the Estonian files, this time even with hint language, it would not detect the right encoding (if any at all).
Also I confirmed that uchardet does not know about them as well, by the way.

Now thinking back about it, that's probably normal. All these encoding are on 8-bit, which make them quite difficult (if sometimes possible at all) to differentiate.

And the doc actually says:

The special language none can be shortened to __, it contains no 8bit encodings, so only multibyte encodings are detected.

So that makes me think that indeed ENCA seems useless as a default.
The only points which stop me are:
1/ I haven't tested all encoding from the list. It may still be useful somewhere.
2/ ENCA manual also says:

It tries to determine your language and preferred charset from locale settings, which might not be what you want.

So that makes me think that someone with an Hungarian locale for instance would maybe get the right encoding with one's locale being used as language hint.
Note that I haven't tested the later though since I really don't want to build an Hungarian (or any other than I need) locale.

Jehan · 2015-08-04T15:28:51Z

Also I don't think I'll do much more tests. I already spent a little too much for what I wanted.
I just wanted to avoid regressions for others.

But after all these tests, it seems that ENCA may not be that useful without a language hint (though once again I haven't tested all cases and especially whether it really uses the user's locale as a hint or not, if none was specified). So if you still feel like dropping ENCA as a default, I will do the change. :-)

ghost · 2015-08-04T15:29:18Z

So if you still feel like dropping ENCA as a default, I will do the change. :-)

Yep.

If mpv is not built with uchardet, "enca" is still the fallback default encoding detection.

Jehan · 2015-08-04T15:40:19Z

Done.

ghost · 2015-08-04T16:05:57Z

Thanks. Merged as e7897df.

Jehan force-pushed the master branch from 248c478 to e870b9e Compare August 4, 2015 13:29

charset_conv: "auto" encoding detection now uses uchardet.

b11383a

If mpv is not built with uchardet, "enca" is still the fallback default encoding detection.

Jehan force-pushed the master branch from e870b9e to b11383a Compare August 4, 2015 15:40

ghost closed this Aug 4, 2015

xLn2 mentioned this pull request Feb 9, 2024

unwanted character rendering of subtitles with incorrectly defined character set clsid2/mpc-hc#2548

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

zhen-huan-hu commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Conversation

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

zhen-huan-hu commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015

Jehan commented Aug 4, 2015

ghost commented Aug 4, 2015