-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed default encoding detection behavior to try uchardet if ENCA failed #2193
Conversation
Also, no need to separate the code and manpage commits. They belong together. |
Commits merged. As for the comment about replacing enca by uchardet, I won't mind for my own current use cases. But comparing the lists of supported encoding by both libs, it seems that ENCA might still have some encoding not supported by uchardet, in particular CP1257, KOI8-UNI, KOI8-U (just checked a few lines). I say "might" though because of the encoding name jungle, with encoding having many names, other being compatible with older ones (like GBK is not in uchardet list, but according to Wikipedia GB18030 is a newer standard compatible with GBK), etc. So the lib may complete each other and it is possible that just replacing enca by uchardet, some European users would see regressions. This is why I went for both ENCA, then uchardet instead, which would cover the union of all supported languages. |
Oh well, I see that you removed your comment about replacing ENCA by uchardet. |
No, I didn't. Github does this when removing commits from a PR when force-pushing to it. |
I don't like this idea of using ENCA and making uchardet a fallback. This also essentially disables uchardet by default if ENCA isn't compiled. |
Oh I see. That's confusing for the discussion. Checking further uchardet, it also seems to miss ISO-8859-2, ISO-8859-4, etc. According to Wikipedia, that covers a whole bunch of Eastern and North Europe languages. It really look like both lib complete each others.
Well we could do the opposite, I don't mind.
No it doesn't. Just to be sure, I even tested right now by rebuilding after disabling enca. You can still add --sub-codepage=uchardet to the command line (and it works), and the default behavior becomes just checking directly though uchardet. So what do you want to do? |
Did you test this?
Yes, but not with the defaults. |
Just wanna comment that GB18030 (also called CP54936 by Microsoft, including 27484 Chinese characters) indeed is a superset of GBK (also called CP936 by Microsoft, including 21886 characters) and GB2312 (7445 characters) and compatible with both. So subtitles encoded in GBK or GB2312 (not very likely) can both be decoded properly under the definition of GB18030. |
No because I don't have subtitles from any of these countries in any of these encodings. I will see if I can build one quickly by grabing some text over Wikipedia.
It did work. I disabled enca, then ran mpv with a EUC-KR subtitle and no command line option, and uchardet detected the encoding. |
OK, I see now. |
Ok I created and tested a few files with some Hungarian in ISO-8859-2 and Estonian in both ISO-8859-4 and CP1257. Well ENCA failed. It has never been able to detect ISO-8859-2 without the hint language; and any of the Estonian files, this time even with hint language, it would not detect the right encoding (if any at all). Now thinking back about it, that's probably normal. All these encoding are on 8-bit, which make them quite difficult (if sometimes possible at all) to differentiate. And the doc actually says:
So that makes me think that indeed ENCA seems useless as a default.
So that makes me think that someone with an Hungarian locale for instance would maybe get the right encoding with one's locale being used as language hint. |
Also I don't think I'll do much more tests. I already spent a little too much for what I wanted. But after all these tests, it seems that ENCA may not be that useful without a language hint (though once again I haven't tested all cases and especially whether it really uses the user's locale as a hint or not, if none was specified). So if you still feel like dropping ENCA as a default, I will do the change. :-) |
Yep. |
If mpv is not built with uchardet, "enca" is still the fallback default encoding detection.
Done. |
Thanks. Merged as e7897df. |
Hi,
I understand that ENCA should still stay the default detection algorithm for the time being (even with uchardet built now in #908).
And actually checking the list of encoding supported by ENCA, even though has much more, there still seems to be some not supported by uchardet (for instance I can't find CP1257/Windows-1257 in uchardet list, unless it has other naming).
But I propose this improvement with the default detection: if ENCA fails, then uchardet is tried.
I tried with some Korean subtitle and it worked great: ENCA not being able to recognize the encoding, it returned NULL and uchardet took the relay, and got the right encoding! :-)