Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed default encoding detection behavior to try uchardet if ENCA failed #2193

Closed
wants to merge 1 commit into from

Conversation

Jehan
Copy link

@Jehan Jehan commented Aug 4, 2015

Hi,

I understand that ENCA should still stay the default detection algorithm for the time being (even with uchardet built now in #908).
And actually checking the list of encoding supported by ENCA, even though has much more, there still seems to be some not supported by uchardet (for instance I can't find CP1257/Windows-1257 in uchardet list, unless it has other naming).

But I propose this improvement with the default detection: if ENCA fails, then uchardet is tried.
I tried with some Korean subtitle and it worked great: ENCA not being able to recognize the encoding, it returned NULL and uchardet took the relay, and got the right encoding! :-)

@ghost
Copy link

ghost commented Aug 4, 2015

Also, no need to separate the code and manpage commits. They belong together.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Commits merged.

As for the comment about replacing enca by uchardet, I won't mind for my own current use cases. But comparing the lists of supported encoding by both libs, it seems that ENCA might still have some encoding not supported by uchardet, in particular CP1257, KOI8-UNI, KOI8-U (just checked a few lines).

I say "might" though because of the encoding name jungle, with encoding having many names, other being compatible with older ones (like GBK is not in uchardet list, but according to Wikipedia GB18030 is a newer standard compatible with GBK), etc. So the lib may complete each other and it is possible that just replacing enca by uchardet, some European users would see regressions.

This is why I went for both ENCA, then uchardet instead, which would cover the union of all supported languages.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Oh well, I see that you removed your comment about replacing ENCA by uchardet.

@ghost
Copy link

ghost commented Aug 4, 2015

No, I didn't. Github does this when removing commits from a PR when force-pushing to it.

@ghost
Copy link

ghost commented Aug 4, 2015

I don't like this idea of using ENCA and making uchardet a fallback. This also essentially disables uchardet by default if ENCA isn't compiled.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Github does this when removing commits from a PR when force-pushing to it.

Oh I see. That's confusing for the discussion.

Checking further uchardet, it also seems to miss ISO-8859-2, ISO-8859-4, etc. According to Wikipedia, that covers a whole bunch of Eastern and North Europe languages. It really look like both lib complete each others.

I don't like this idea of using ENCA and making uchardet a fallback.

Well we could do the opposite, I don't mind.

This also essentially disables uchardet by default if ENCA isn't compiled.

No it doesn't. Just to be sure, I even tested right now by rebuilding after disabling enca. You can still add --sub-codepage=uchardet to the command line (and it works), and the default behavior becomes just checking directly though uchardet.

So what do you want to do?

@ghost
Copy link

ghost commented Aug 4, 2015

Checking further uchardet, it also seems to miss ISO-8859-2, ISO-8859-4, etc. According to Wikipedia, that covers a whole bunch of Eastern and North Europe languages. It really look like both lib complete each others.

Did you test this?

You can still add --sub-codepage=uchardet to the command line (and it works),

Yes, but not with the defaults.

@zhen-huan-hu
Copy link

Just wanna comment that GB18030 (also called CP54936 by Microsoft, including 27484 Chinese characters) indeed is a superset of GBK (also called CP936 by Microsoft, including 21886 characters) and GB2312 (7445 characters) and compatible with both. So subtitles encoded in GBK or GB2312 (not very likely) can both be decoded properly under the definition of GB18030.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Did you test this?

No because I don't have subtitles from any of these countries in any of these encodings. I will see if I can build one quickly by grabing some text over Wikipedia.

Yes, but not with the defaults.

It did work. I disabled enca, then ran mpv with a EUC-KR subtitle and no command line option, and uchardet detected the encoding.

@ghost
Copy link

ghost commented Aug 4, 2015

It did work.

OK, I see now.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Ok I created and tested a few files with some Hungarian in ISO-8859-2 and Estonian in both ISO-8859-4 and CP1257. Well ENCA failed. It has never been able to detect ISO-8859-2 without the hint language; and any of the Estonian files, this time even with hint language, it would not detect the right encoding (if any at all).
Also I confirmed that uchardet does not know about them as well, by the way.

Now thinking back about it, that's probably normal. All these encoding are on 8-bit, which make them quite difficult (if sometimes possible at all) to differentiate.

And the doc actually says:

The special language none can be shortened to __, it contains no 8bit encodings, so only multibyte encodings are detected.

So that makes me think that indeed ENCA seems useless as a default.
The only points which stop me are:
1/ I haven't tested all encoding from the list. It may still be useful somewhere.
2/ ENCA manual also says:

It tries to determine your language and preferred charset from locale settings, which might not be what you want.

So that makes me think that someone with an Hungarian locale for instance would maybe get the right encoding with one's locale being used as language hint.
Note that I haven't tested the later though since I really don't want to build an Hungarian (or any other than I need) locale.

@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Also I don't think I'll do much more tests. I already spent a little too much for what I wanted.
I just wanted to avoid regressions for others.

But after all these tests, it seems that ENCA may not be that useful without a language hint (though once again I haven't tested all cases and especially whether it really uses the user's locale as a hint or not, if none was specified). So if you still feel like dropping ENCA as a default, I will do the change. :-)

@ghost
Copy link

ghost commented Aug 4, 2015

So if you still feel like dropping ENCA as a default, I will do the change. :-)

Yep.

If mpv is not built with uchardet, "enca" is still the fallback default
encoding detection.
@Jehan
Copy link
Author

Jehan commented Aug 4, 2015

Done.

@ghost
Copy link

ghost commented Aug 4, 2015

Thanks. Merged as e7897df.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants