Permalink
Browse files

stream_libarchive: workaround various types of locale braindeath

Fix that libarchive fails to return filenames for UTF-8/UTF-16 entries.
The reason is that it uses locales and all that garbage, and mpv does
not set a locale.

Both C locales and wchar_t are shitfucked retarded legacy braindeath. If
the C/POSIX standard committee had actually competent members, these
would have been deprecated or removed long ago. (I mean, they managed to
remove gets().) To justify this emotional outbreak potentially insulting
to unknown persons, I will write a lot of text. Those not comfortable
with toxic language should pretend this is a religious text.

C locales are supposed to be a way to support certain languages and
cultures easier. One example are character codepages. Back when UTF-8
was not invented yet, there were only 255 possible characters, which is
not enough for anything but English and some european languages. So they
decided to make the meaning of a character dependent on the current
codepage. The locale (LC_CTYPE specifically) determines what character
encoding is currently used.

Of course nowadays, this is legacy nonsense. Everything uses UTF-8 for
"char", and what doesn't is broken and terrible anyway. But the old ways
stayed with us, and the stupidity of it as well.

C locales were utterly moronic even when they were invented. The locale
(via setlocale()) is global state, and global state is not a reasonable
way to do anything. It will break libraries, or well modularized code.
(The latter would be forced to strictly guard all entrypoints set
set/restore locales, assuming a single threaded world.)

On top of that, setting a locale randomly changes the semantics of a
bunch of standard functions. If a function respects locale, you suddenly
can't rely on it to behave the same on all systems. Some behavior can
come as a surprise, and of course it will be dependent on the region of
the user (it doesn't help that most software is US-centric, and the US
locale is almost like the C locale, i.e. almost what you expect).

Idiotically, locales were not just used to define the current character
encoding, but the concept was used for a whole lot of things, like e. g.
whether numbers should use "," or "." as decimal separaror. The latter
issue is actually much worse, because it breaks basic string conversion
or parsing of numbers for the purpose of interacting with file formats
and such.

Much can be said about how retarded locales are, even beyond what I just
wrote, or will wrote below. They are so hilariously misdesigned and
insufficient, I can't even fathom how this shit was _standardized_. (In
any case, that meant everyone was forced to implement it.) Many C
functions can't even do it correctly. For example, the character set
encoding can be a multibyte encoding (not just UTF-8, but awful garbage
like Shift JIS (sometimes called SHIT JIZZ), yet functions like
toupper() can return only 1 byte. Or just take the fact that the locale
API tries to define standard paper sizes (LC_PAPER) or telephone number
formatting (LC_TELEPHONE). Who the fuck uses this, or would ever use
this?

But the badness doesn't stop here. At some point, they invented threads.
And they put absolutely no thought into how threads should interact with
locales. So they kept locales as global state. Because obviously, you
want to be able to change the semantics of basic string processing
functions _while_ they're running, right? (Any thread can call
setlocale() at any time, and it's supposed to change the locale of all
other threads.)

At this point, how the fuck are you supposed to do anything correctly?
You can't even temporarily switch the locale with setlocale(), because
it would asynchronously fuckup the other threads. All you can do is to
enforce a convention not to set anything but the C local (this is what
mpv does), or to duplicate standard functions using code that doesn't
query locale (this is what e.g. libass does, a close dependency of mpv).

Imagine they had done this for certain other things. Like errno, with
all the brokenness of the locale API. This simply wouldn't have worked,
shit would just have been too broken. So they didn't. But locales give a
delicious sweet spot of brokenness, where things are broken enough to
cause neverending pain, but not broken enough that enough effort would
have spent to fix it completely.

On that note, standard C11 actually can't stringify an error value. It
does define strerror(), but it's not thread safe, even though C11
supports threads. The idiots could just have defined it to be thread
safe. Even if your libc is horrible enough that it can't return string
literals, it could just just some thread local buffer. Because C11 does
define thread local variables. But hey, why care about details, if you
can just create a shitty standard?

(POSIX defines strerror_r(), which "solves" this problem, while still
not making strerror() thread safe.)

Anyway, back to threads. The interaction of locales and threads makes no
sense. Why would you make locales process global? Who even wanted it to
work this way? Who decided that it should keep working this way, despite
being so broken (and certainly causing implementation difficulties in
libc)? Was it just a fucked up psychopath?

Several decades later, the moronic standard committees noticed that this
was (still is) kind of a bad situation. Instead of fixing the situation,
they added more garbage on top of it. (Probably for the sake of
"compatibility"). Now there is a set of new functions, which allow you
to override the locale for the current thread. This means you can
temporarily override and restore the local on all entrypoints of your
code (like you could with setlocale(), before threads were invented).

And of course not all operating systems or libcs implement this. For
example, I'm pretty sure Microsoft doesn't. (Microsoft got to fuck it up
as usual, and only provides _configthreadlocale(). This is shitfucked on
its own, because it's GLOBAL STATE to configure that GLOBAL STATE should
not be GLOBAL STATE, i.e. completely broken garbage, because it requires
agreement over all modules/libraries what behavior should be used. I
mean, sure, makign setlocale() affect only the current thread would have
been the reasonable behavior. Making this behavior configurable isn't,
because you can't rely on what behavior is active.)

POSIX showed some minor decency by at least introducing some variations
of standard functions, which have a locale argument (e.g. toupper_l()).
You just pass the locale which you want to be used, and don't have to do
the set locale/call function/restore locale nonense. But OF COURSE they
fucked this up too. In no less than 2 ways:

- There is no statically available handle for the C locale, so you have
  to initialize and store it somewhere, which makes it harder to make
  utility functions safe, that call locale-affected standard functions
  and expect C semantics. The easy solution, using pthread_once() and a
  global variable with the created locale, will not be easily accepted
  by pedantic assholes, because they'll worry about allocation failure,
  or leaking the locale when using this in library code (and then
  unloading the library). Or you could have complicated library
  init/uninit functions, which bring a big load of their own mess.
  Same for automagic DLL constructors/destructors.
- Not all functions have a variant that takes a locale argument, and
  they missed even some important ones, like snprintf() or strtod() WHAT
  THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT
  THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK

I would like to know why it took so long to standardize a half-assed
solution, that, apart from being conceptually half-assed, is even
incomplete and insufficient. The obvious way to fix this would have
been:

- deprecate the entire locale API and their use, and make it a NOP
- make UTF-8 the standard character type
- make the C locale behavior the default
- add new APIs that explicitly take locale objects
- provide an emulation layer, that can be used to transparently build
  legacy code without breaking them

But this wouldn't have been "compatible", and the apparently incompetent
standard committees would have never accepted this. As if anyone
actually used this legacy garbage, except other legacy garbage. Oh yeah,
and let's care a lot about legacy compatibility, and let's not care  at
all about modern code that either has to suffer from this, or subtly
breaks when the wrong locales are active.

Last but not least, the UTF-8 locale name is apparently not even
standardized. At the moment I'm trying to use "C.UTF-8", which is
apparently glibc _and_ Debian specific. Got to use every opportunity to
make correct usage of UTF-8 harder. What luck that this commit is only
for some optional relatively obscure mpv feature.

Why is the C locale not UTF-8? Why did POSIX not standardize an UTF-8
locale? Well, according to something I heard a few years ago, they're
considering disallowing UTF-8 as locale, because UTF-8 would violate
certain ivnariants expected by C or POSIX. (But I'm not sure if I
remember this correctly - probably better not to rage about it.)

Now, on to libarchive.

libarchive intentionally uses the locale API and all the broken crap
around it to "convert" UTF-8 or UTF-16 (as contained in reasonably sane
archive formats) to "char*". This is a good start!

Since glibc does not think that the C locale uses UTF-8, this fails for
mpv. So trying to use archive_entry_pathname() to get the archive entry
name fails if the name contains non-ASCII characters.

Maybe use archive_entry_pathname_utf8()? Surely that should return
UTF-8, since its name seems to indicate that it returns UTF-8. But of
fucking course it doesn't! libarchive's horribly convoluted code (that
is full of locale API usage and other legacy shit, as well as ifdefs and
OS specific code, including Windows and fucking Cygwin) somehow fucks up
and fails if the locale is not set to UTF-8. I made a PR fixing this in
libarchive almost 2 years ago, but it was ignored.

So, would archive_entry_pathname_w() as fallback work? No, why would it?
Of course this _also_ involves shitfucked code that calls shitfucked
standard functions (or OS specific ifdeffed shitfuck). The truth is that
at least glibc changes the meaning of wchar_t depending on the locale.
Unlike most people think, wchar_t is not standardized to be an UTF
variant (or even unicode) - it's an encoding that uses basic units that
can be larger than 8 bit. It's an implementation defined thing. Windows
defines it to 2 bytes and UTF-16, and glibc defines it to 4 bytes and
UTF-32, but only if an UTF-8 locale is set (apparently).

Yes. Every libarchive function dealing with strings has 3 variants:
plain, _utf8, and _w. And none of these work if the locale is not set.
I cannot fathom why they even have a wchar_t variant, because it's
redundant and fucking useless for any modern code.

Writing a UTF-16 to UTF-8 conversion routine is maybe 3 pages of code,
or a few lines if you use iconv. But libarchive uses all this glorious
bullshit, and ends up with 3 not working API functions, and with over
4000 lines of its own string abstraction code with gratuitous amounts of
ifdefs and OS dependent code that breaks in a fairly common use case.

So what we do is:

- Use the idiotic POSIX 2008 API (uselocale() etc.) (Too bad for users
  who try to build this on a system that doesn't have these - hopefully
  none are left in 2017. But if there are, torturing them with obscure
  build errors is probably justified. Might be bad for Windows though,
  which is a very popular platform except on phones.)
- Use the "C.UTF-8" locale, which is probably not 100% standards
  compliant, but works on my system, so it's fine.
- Guard every libarchive call with uselocale() + restoring the locale.
- Be lazy and skip some libarchive calls. Look forward to the unlikely
  and astonishingly stupid bugs this could produce.

We could also just set a C UTF-8 local in main (since that would have no
known negative effects on the rest of the code), but this won't work for
libmpv.

We assume that uselocale() never fails. In an unexplainable stroke of
luck, POSIX made the semantics of uselocale() nice enough that user code
can fail failures without introducing crash or security bugs, even if
there should be an implementation fucked up enough where it's actually
possible that uselocale() fails even with valid input.

With all this shitty ugliness added, it finally works, without fucking
up other parts of the player. This is still less bad than that time when
libquivi fucked up OpenGL rendering, because calling a libquvi function
would load some proxy abstraction library, which in turn loaded a KDE
plugin (even if KDE was not used), which in turn called setlocale()
because Qt does this, and consequently made the mpv GLSL shader
generation code emit "," instead of "." for numbers, and of course only
for users who had that KDE plugin installed, and lived in a part of the
world where "." is not used as decimal separator.

All in all, I believe this proves that software developers as a whole
and as a culture produce worse results than drug addicted butt fucked
monkeys randomly hacking on typewriters while inhaling the fumes of a
radioactive dumpster fire fueled by chinese platsic toys for children
and Elton John/Justin Bieber crossover CDs for all eternity.
  • Loading branch information...
wm4
wm4 committed Nov 12, 2017
1 parent 9b5d062 commit 1e70e82baa9193f6f027338b0fab0f5078971fbe
Showing with 36 additions and 4 deletions.
  1. +33 βˆ’4 stream/stream_libarchive.c
  2. +3 βˆ’0 stream/stream_libarchive.h
View
@@ -150,6 +150,8 @@ static bool mp_archive_check_fatal(struct mp_archive *mpa, int r)
void mp_archive_free(struct mp_archive *mpa)
{
mp_archive_close(mpa);
+ if (mpa && mpa->locale)
+ freelocale(mpa->locale);
talloc_free(mpa);
}
@@ -229,14 +231,20 @@ static bool add_volume(struct mp_log *log, struct mp_archive *mpa,
vol->mpa = mpa;
vol->src = src;
vol->url = talloc_strdup(vol, url);
- return archive_read_append_callback_data(mpa->arch, vol) == ARCHIVE_OK;
+ locale_t oldlocale = uselocale(mpa->locale);
+ bool res = archive_read_append_callback_data(mpa->arch, vol) == ARCHIVE_OK;
+ uselocale(oldlocale);
+ return res;
}
struct mp_archive *mp_archive_new(struct mp_log *log, struct stream *src,
int flags)
{
struct mp_archive *mpa = talloc_zero(NULL, struct mp_archive);
mpa->log = log;
+ mpa->locale = newlocale(LC_ALL_MASK, "C.UTF-8", (locale_t)0);
+ if (!mpa->locale)
+ goto err;
mpa->arch = archive_read_new();
mpa->primary_src = src;
if (!mpa->arch)
@@ -256,6 +264,8 @@ struct mp_archive *mp_archive_new(struct mp_log *log, struct stream *src,
}
talloc_free(volumes);
+ locale_t oldlocale = uselocale(mpa->locale);
+
archive_read_support_format_7zip(mpa->arch);
archive_read_support_format_iso9660(mpa->arch);
archive_read_support_format_rar(mpa->arch);
@@ -275,7 +285,11 @@ struct mp_archive *mp_archive_new(struct mp_log *log, struct stream *src,
archive_read_set_close_callback(mpa->arch, close_cb);
if (mpa->primary_src->seekable)
archive_read_set_seek_callback(mpa->arch, seek_cb);
- if (archive_read_open1(mpa->arch) < ARCHIVE_OK)
+ bool fail = archive_read_open1(mpa->arch) < ARCHIVE_OK;
+
+ uselocale(oldlocale);
+
+ if (fail)
goto err;
return mpa;
@@ -295,6 +309,9 @@ bool mp_archive_next_entry(struct mp_archive *mpa)
if (!mpa->arch)
return false;
+ locale_t oldlocale = uselocale(mpa->locale);
+ bool success = false;
+
while (!mp_cancel_test(mpa->primary_src->cancel)) {
struct archive_entry *entry;
int r = archive_read_next_header(mpa->arch, &entry);
@@ -319,10 +336,13 @@ bool mp_archive_next_entry(struct mp_archive *mpa)
mpa->entry = entry;
mpa->entry_filename = talloc_strdup(mpa, fn);
mpa->entry_num += 1;
- return true;
+ success = true;
+ break;
}
- return false;
+ uselocale(oldlocale);
+
+ return success;
}
struct priv {
@@ -344,9 +364,11 @@ static int reopen_archive(stream_t *s)
struct mp_archive *mpa = p->mpa;
while (mp_archive_next_entry(mpa)) {
if (strcmp(p->entry_name, mpa->entry_filename) == 0) {
+ locale_t oldlocale = uselocale(mpa->locale);
p->entry_size = -1;
if (archive_entry_size_is_set(mpa->entry))
p->entry_size = archive_entry_size(mpa->entry);
+ uselocale(oldlocale);
return STREAM_OK;
}
}
@@ -362,6 +384,7 @@ static int archive_entry_fill_buffer(stream_t *s, char *buffer, int max_len)
struct priv *p = s->priv;
if (!p->mpa)
return 0;
+ locale_t oldlocale = uselocale(p->mpa->locale);
int r = archive_read_data(p->mpa->arch, buffer, max_len);
if (r < 0) {
MP_ERR(s, "%s\n", archive_error_string(p->mpa->arch));
@@ -370,6 +393,7 @@ static int archive_entry_fill_buffer(stream_t *s, char *buffer, int max_len)
p->mpa = NULL;
}
}
+ uselocale(oldlocale);
return r;
}
@@ -378,7 +402,9 @@ static int archive_entry_seek(stream_t *s, int64_t newpos)
struct priv *p = s->priv;
if (!p->mpa)
return -1;
+ locale_t oldlocale = uselocale(p->mpa->locale);
int r = archive_seek_data(p->mpa->arch, newpos, SEEK_SET);
+ uselocale(oldlocale);
if (r >= 0)
return 1;
if (mp_archive_check_fatal(p->mpa, r)) {
@@ -404,15 +430,18 @@ static int archive_entry_seek(stream_t *s, int64_t newpos)
return -1;
int size = MPMIN(newpos - s->pos, sizeof(buffer));
+ oldlocale = uselocale(p->mpa->locale);
r = archive_read_data(p->mpa->arch, buffer, size);
if (r < 0) {
MP_ERR(s, "%s\n", archive_error_string(p->mpa->arch));
+ uselocale(oldlocale);
if (mp_archive_check_fatal(p->mpa, r)) {
mp_archive_free(p->mpa);
p->mpa = NULL;
}
return -1;
}
+ uselocale(oldlocale);
s->pos += r;
}
}
@@ -1,6 +1,9 @@
+#include <locale.h>
+
struct mp_log;
struct mp_archive {
+ locale_t locale;
struct mp_log *log;
struct archive *arch;
struct stream *primary_src;

28 comments on commit 1e70e82

@garoto

This comment has been minimized.

Show comment Hide comment
@garoto

garoto Nov 12, 2017

slowclap.gif

garoto replied Nov 12, 2017

slowclap.gif

@1DC

This comment has been minimized.

Show comment Hide comment
@1DC

1DC Nov 12, 2017

Is this some kind of record?

1DC replied Nov 12, 2017

Is this some kind of record?

@lachs0r

This comment has been minimized.

Show comment Hide comment
@lachs0r

lachs0r Nov 12, 2017

Member

Forgot to mention the LADSPA filters invading programs via libasound and breaking mpv’s OpenGL renderer. For some reason many LADPSA filters call setlocale()…
The fun part is that this can happen even if you don’t use them directly because the ALSA LADSPA plugin would scan the LADSPA path for plugins, some of which do this as soon as they’re loaded.

Member

lachs0r replied Nov 12, 2017

Forgot to mention the LADSPA filters invading programs via libasound and breaking mpv’s OpenGL renderer. For some reason many LADPSA filters call setlocale()…
The fun part is that this can happen even if you don’t use them directly because the ALSA LADSPA plugin would scan the LADSPA path for plugins, some of which do this as soon as they’re loaded.

@Qix-

This comment has been minimized.

Show comment Hide comment
@Qix-

Qix- Nov 12, 2017

I love you @wm4.

Qix- replied Nov 12, 2017

I love you @wm4.

@lu-zero

This comment has been minimized.

Show comment Hide comment
@lu-zero

lu-zero Nov 12, 2017

Contributor

I wonder when you'll fork libarchive and make it 1/4 of the size by removing this whole mess.

Contributor

lu-zero replied Nov 12, 2017

I wonder when you'll fork libarchive and make it 1/4 of the size by removing this whole mess.

@wm4

This comment has been minimized.

Show comment Hide comment
@wm4

wm4 Nov 12, 2017

Contributor

Never. Anyway, I gave up hoping libarchive would fix this properly.

Contributor

wm4 replied Nov 12, 2017

Never. Anyway, I gave up hoping libarchive would fix this properly.

@mahkoh

This comment has been minimized.

Show comment Hide comment
@mahkoh

mahkoh Nov 12, 2017

You left out my favorite fun fact:

#include <stdio.h>
#include <locale.h>
#include <string.h>

#define p(f) printf(#f "(퍼, 흐) = %d\n", f("퍼", "흐"))
 
int main(void) {
	setlocale(LC_COLLATE, "en_US.UTF-8");
 
	p(strcmp); // strcmp(퍼, 흐) = -1
	p(strcoll); // strcoll(퍼, 흐) = 0
}

strcoll is strcmp that respects the locale. This will fuck up your library if you try to sort titles in the user's locale.

This only happens in the glibc implementation because POSIX forgot to specify that strings should have a total order in all locales. https://sourceware.org/bugzilla/show_bug.cgi?id=18927

You left out my favorite fun fact:

#include <stdio.h>
#include <locale.h>
#include <string.h>

#define p(f) printf(#f "(퍼, 흐) = %d\n", f("퍼", "흐"))
 
int main(void) {
	setlocale(LC_COLLATE, "en_US.UTF-8");
 
	p(strcmp); // strcmp(퍼, 흐) = -1
	p(strcoll); // strcoll(퍼, 흐) = 0
}

strcoll is strcmp that respects the locale. This will fuck up your library if you try to sort titles in the user's locale.

This only happens in the glibc implementation because POSIX forgot to specify that strings should have a total order in all locales. https://sourceware.org/bugzilla/show_bug.cgi?id=18927

@haasn

This comment has been minimized.

Show comment Hide comment
@haasn

haasn Nov 12, 2017

Member

Couldn't have possibly been put better. Seriously, fuck C.

I eagerly await the day an alternative stars existing.

Member

haasn replied Nov 12, 2017

Couldn't have possibly been put better. Seriously, fuck C.

I eagerly await the day an alternative stars existing.

@Stargateur

This comment has been minimized.

Show comment Hide comment
@Stargateur

Stargateur Nov 12, 2017

@haasn Rust ?

By the way, "and lived in a part of the world where "." is not used as decimal separator.", https://en.wikipedia.org/wiki/Decimal_mark#/media/File:DecimalSeparator.svg. I think comma win ^^ but this is not important.

@haasn Rust ?

By the way, "and lived in a part of the world where "." is not used as decimal separator.", https://en.wikipedia.org/wiki/Decimal_mark#/media/File:DecimalSeparator.svg. I think comma win ^^ but this is not important.

@infinity0

This comment has been minimized.

Show comment Hide comment
@infinity0

infinity0 Nov 12, 2017

Epic shit man. One minor correction:

Windows defines it to 2 bytes and UTF-16

It's UCS-2 not UTF-16, i.e. UTF-16 without surrogate pairs, and the APIs dealing with it don't enforce UTF-16 validity. http://unicode.org/faq/utf_bom.html#utf16-11

The More You Know.

Fuck C and POSIX.

Epic shit man. One minor correction:

Windows defines it to 2 bytes and UTF-16

It's UCS-2 not UTF-16, i.e. UTF-16 without surrogate pairs, and the APIs dealing with it don't enforce UTF-16 validity. http://unicode.org/faq/utf_bom.html#utf16-11

The More You Know.

Fuck C and POSIX.

@retep998

This comment has been minimized.

Show comment Hide comment
@retep998

retep998 Nov 12, 2017

Windows uses WTF-16, not UCS-2 nor UTF-16. It is a superset of UTF-16, which supports surrogate pairs just fine, but it also allows lone surrogates. As long as you stick to the WTF-16 part of Windows, you have a single consistent encoding and everything is fine. You can even convert it losslessly to and from WTF-8 which is a superset of UTF-8. Rust handles this with OsStr/Path and for the most part things work out really well.

But yeah, try to work with any C library that ventures outside this wide world into the land of narrow encoding and everything falls apart. The system encoding on Windows is never UTF-8. Even if you set the console code page to UTF-8, if you try to read a multibyte character sequence it will fail. You have to stick to WTF-16 or else badness will ensue.

retep998 replied Nov 12, 2017

Windows uses WTF-16, not UCS-2 nor UTF-16. It is a superset of UTF-16, which supports surrogate pairs just fine, but it also allows lone surrogates. As long as you stick to the WTF-16 part of Windows, you have a single consistent encoding and everything is fine. You can even convert it losslessly to and from WTF-8 which is a superset of UTF-8. Rust handles this with OsStr/Path and for the most part things work out really well.

But yeah, try to work with any C library that ventures outside this wide world into the land of narrow encoding and everything falls apart. The system encoding on Windows is never UTF-8. Even if you set the console code page to UTF-8, if you try to read a multibyte character sequence it will fail. You have to stick to WTF-16 or else badness will ensue.

@infinity0

This comment has been minimized.

Show comment Hide comment
@infinity0

infinity0 Nov 13, 2017

I don't want to get technical but I did waste too much time on this bullshit topic so I might as well correct mistakes that I spot. WTF-16 is a neologism invented here, UCS-2 is the older term and it's not correct to say WTF-16 is "not UCS-2". Windows wchar_t wide strings "supports" surrogate pairs in the sense that it leaves them alone and it doesn't mess with them, but it doesn't interpret them as codepoints - you have to explicitly convert them to a "multibyte string" for that. Windows low-level system APIs only use wide strings (i.e. wchar_t, UCS-2, WTF-16, whatever). The confusion is furthered by the fact that lots of online docs call this "UTF-16" including Wikipedia. The Microsoft term is "wide string" and this refers to wchar_t 16-bit string, uninterpreted and unencoded, with a (implicit) 1-to-1 mapping between 16-bit chars and Unicode codepoints.

edit: actually even some Microsoft docs call this "UTF-16" but this is wrong at least in the context of the low-level system APIs, since they don't perform decoding nor validation. For example wcslen returns the length of the string in 16-bit units, without trying to decode any surrogate pairs.

infinity0 replied Nov 13, 2017

I don't want to get technical but I did waste too much time on this bullshit topic so I might as well correct mistakes that I spot. WTF-16 is a neologism invented here, UCS-2 is the older term and it's not correct to say WTF-16 is "not UCS-2". Windows wchar_t wide strings "supports" surrogate pairs in the sense that it leaves them alone and it doesn't mess with them, but it doesn't interpret them as codepoints - you have to explicitly convert them to a "multibyte string" for that. Windows low-level system APIs only use wide strings (i.e. wchar_t, UCS-2, WTF-16, whatever). The confusion is furthered by the fact that lots of online docs call this "UTF-16" including Wikipedia. The Microsoft term is "wide string" and this refers to wchar_t 16-bit string, uninterpreted and unencoded, with a (implicit) 1-to-1 mapping between 16-bit chars and Unicode codepoints.

edit: actually even some Microsoft docs call this "UTF-16" but this is wrong at least in the context of the low-level system APIs, since they don't perform decoding nor validation. For example wcslen returns the length of the string in 16-bit units, without trying to decode any surrogate pairs.

@richfelker

This comment has been minimized.

Show comment Hide comment
@richfelker

richfelker Nov 13, 2017

This is going to fail on most systems, pretty much anything but musl or recent (last couple years?) glibc, I think:

mpa->locale = newlocale(LC_ALL_MASK, "C.UTF-8", (locale_t)0);

I might suggest instead something like:

mpa->locale = newlocale(LC_ALL_MASK - LC_CTYPE_MASK, "C", (locale_t)0);

This will make a locale object that's "C" (guaranteed to exist; won't fail except for possible OOM) in all categories by LC_CTYPE, and matches the default locale (determined by environment or system default) in LC_CTYPE. This should be fine unless you're depending on not having any wacky locale-specific case mappings. If that's a problem, you could instead call newlocale again on the result to try replacing LC_CTYPE with various known "benign" UTF-8 locales like "C.UTF-8", "en_US.UTF-8", etc. that might exist on the system.

FYI, musl has a special case for newlocale where LC_CTYPE is C.UTF-8 and everything else is C; it's statically allocated and thus can't fail even if there's no memory.

This is going to fail on most systems, pretty much anything but musl or recent (last couple years?) glibc, I think:

mpa->locale = newlocale(LC_ALL_MASK, "C.UTF-8", (locale_t)0);

I might suggest instead something like:

mpa->locale = newlocale(LC_ALL_MASK - LC_CTYPE_MASK, "C", (locale_t)0);

This will make a locale object that's "C" (guaranteed to exist; won't fail except for possible OOM) in all categories by LC_CTYPE, and matches the default locale (determined by environment or system default) in LC_CTYPE. This should be fine unless you're depending on not having any wacky locale-specific case mappings. If that's a problem, you could instead call newlocale again on the result to try replacing LC_CTYPE with various known "benign" UTF-8 locales like "C.UTF-8", "en_US.UTF-8", etc. that might exist on the system.

FYI, musl has a special case for newlocale where LC_CTYPE is C.UTF-8 and everything else is C; it's statically allocated and thus can't fail even if there's no memory.

@heftig

This comment has been minimized.

Show comment Hide comment
@heftig

heftig Nov 13, 2017

GlibC upstream still does not have a C.UTF-8 locale, so this will break on Arch Linux and other distros that use vanilla GlibC.

Last time I checked, this locale was not accepted because it has a broken collation that sorts all characters > 0xFFFF between 0x0 and 0x1.

GlibC upstream still does not have a C.UTF-8 locale, so this will break on Arch Linux and other distros that use vanilla GlibC.

Last time I checked, this locale was not accepted because it has a broken collation that sorts all characters > 0xFFFF between 0x0 and 0x1.

@smcameron

This comment has been minimized.

Show comment Hide comment
@smcameron

smcameron Nov 13, 2017

You can sort of accomplish this:

  • deprecate the entire locale API and their use, and make it a NOP
  • make the C locale behavior the default

by overriding setlocale like so: https://github.com/smcameron/space-nerds-in-space/blob/3337de7428cb79ab84c82561d1b1dcd3af10a6dc/c-is-the-locale.c

Takes care of the multithreaded case, and the case that whatever stupid libraries that keep calling setlocale all the time (gtk, for example) always get "C" no matter what they ask for.

smcameron replied Nov 13, 2017

You can sort of accomplish this:

  • deprecate the entire locale API and their use, and make it a NOP
  • make the C locale behavior the default

by overriding setlocale like so: https://github.com/smcameron/space-nerds-in-space/blob/3337de7428cb79ab84c82561d1b1dcd3af10a6dc/c-is-the-locale.c

Takes care of the multithreaded case, and the case that whatever stupid libraries that keep calling setlocale all the time (gtk, for example) always get "C" no matter what they ask for.

@krytarowski

This comment has been minimized.

Show comment Hide comment
@krytarowski

krytarowski Nov 13, 2017

uselocale() is absent in NetBSD... time to implement it.

krytarowski replied Nov 13, 2017

uselocale() is absent in NetBSD... time to implement it.

@CounterPillow

This comment has been minimized.

Show comment Hide comment
@CounterPillow

CounterPillow Nov 13, 2017

Contributor
Contributor

CounterPillow replied Nov 13, 2017

@Stargateur
autistic screeching

@Fr0sT-Brutal

This comment has been minimized.

Show comment Hide comment
@Fr0sT-Brutal

Fr0sT-Brutal Nov 13, 2017

Very emotional but why locales are blamed? Most of troubles come from their improper usage.

  • One must NEVER change a locale in application. It is user's choice. Just leave it to him.
  • One must ALWAYS use invariant locale for conversion of data that intended to be processed by a software.
  • Let the data being displayed to user use current system locale. Nobody except Americans will be glad to see that stupid MM/DD/YYYY date format. Really.

Very emotional but why locales are blamed? Most of troubles come from their improper usage.

  • One must NEVER change a locale in application. It is user's choice. Just leave it to him.
  • One must ALWAYS use invariant locale for conversion of data that intended to be processed by a software.
  • Let the data being displayed to user use current system locale. Nobody except Americans will be glad to see that stupid MM/DD/YYYY date format. Really.
@CounterPillow

This comment has been minimized.

Show comment Hide comment
@CounterPillow

CounterPillow Nov 13, 2017

Contributor

Very emotional but why locales are blamed? Most of troubles come from their improper usage.

because locales and the libc in general make improper usage very widespread, which means that even code you do not control will affect global state, which is precisely what the problem here is.

Boy all these "experts" from reddit sure are embarrassing themselves here.

Contributor

CounterPillow replied Nov 13, 2017

Very emotional but why locales are blamed? Most of troubles come from their improper usage.

because locales and the libc in general make improper usage very widespread, which means that even code you do not control will affect global state, which is precisely what the problem here is.

Boy all these "experts" from reddit sure are embarrassing themselves here.

@haasn

This comment has been minimized.

Show comment Hide comment
@haasn

haasn Nov 13, 2017

Member

Very emotional but why locales are blamed?

Because there's no way to do this:

One must ALWAYS use invariant locale for conversion of data that intended to be processed by a software.

In a sane way.

Also, in this particular case, β€œMost of troubles come from their improper usage.” might be true but that doesn't stop libarchive being a pile of shit.

Member

haasn replied Nov 13, 2017

Very emotional but why locales are blamed?

Because there's no way to do this:

One must ALWAYS use invariant locale for conversion of data that intended to be processed by a software.

In a sane way.

Also, in this particular case, β€œMost of troubles come from their improper usage.” might be true but that doesn't stop libarchive being a pile of shit.

@rossy

This comment has been minimized.

Show comment Hide comment
@rossy

rossy Nov 13, 2017

Member

@infinity0 It's more complicated than that. A lot of Windows APIs really do use UTF-16 strings. For example, if you call the Unicode version of SetWindowText with a UTF-16 string that contains supplementary plane characters (and you have the proper fonts installed,) you should see them in the window's title bar.

That a number of APIs are unaware of supplementary plane characters or can't accept surrogate pairs by design (like IsCharUpper,) is just due to the fact that a lot of these APIs were designed before Unicode 2.0, and even afterwards, not all developers knew that Unicode was no-longer a 16-bit encoding. Still, to say Windows strings aren't supposed to be interpreted as UTF-16 is misleading, since they often are.

The filesystem APIs don't validate UTF-16 strings, but I don't think that matters for much. Linux does not require filenames to be encoded in the system codepage either. Linux treats filenames as an uninterpreted sequence of 8-bit code units, and Windows treats filenames as an uninterpreted sequence of 16-bit code units, which is good in a way, because programs don't need to have complicated logic to know what is a legal filename.

When filenames are displayed to the user, Linux will (usually) interpret them in the current codepage (often UTF-8,) and Windows will interpret them as UTF-16. Yes, good Windows filesystem code should allow round-tripping of filenames with ill-formed UTF-16, but it's the same on Linux, where good filesystem code should allow round-tripping of filenames with ill-formed UTF-8. (Rust's WTF-8 encoding is a good way to do this in cross-platform code, but mpv does not use it at the moment.)

And yes, wcslen returns the number of code units, not codepoints in the string, but that's not uncommon in programming languages. You don't often need to know the length of a string in code points, which are not the same thing as user-perceived characters, but you often need to know the length of a string in code units for memory allocation or low-level string manipulation.

Member

rossy replied Nov 13, 2017

@infinity0 It's more complicated than that. A lot of Windows APIs really do use UTF-16 strings. For example, if you call the Unicode version of SetWindowText with a UTF-16 string that contains supplementary plane characters (and you have the proper fonts installed,) you should see them in the window's title bar.

That a number of APIs are unaware of supplementary plane characters or can't accept surrogate pairs by design (like IsCharUpper,) is just due to the fact that a lot of these APIs were designed before Unicode 2.0, and even afterwards, not all developers knew that Unicode was no-longer a 16-bit encoding. Still, to say Windows strings aren't supposed to be interpreted as UTF-16 is misleading, since they often are.

The filesystem APIs don't validate UTF-16 strings, but I don't think that matters for much. Linux does not require filenames to be encoded in the system codepage either. Linux treats filenames as an uninterpreted sequence of 8-bit code units, and Windows treats filenames as an uninterpreted sequence of 16-bit code units, which is good in a way, because programs don't need to have complicated logic to know what is a legal filename.

When filenames are displayed to the user, Linux will (usually) interpret them in the current codepage (often UTF-8,) and Windows will interpret them as UTF-16. Yes, good Windows filesystem code should allow round-tripping of filenames with ill-formed UTF-16, but it's the same on Linux, where good filesystem code should allow round-tripping of filenames with ill-formed UTF-8. (Rust's WTF-8 encoding is a good way to do this in cross-platform code, but mpv does not use it at the moment.)

And yes, wcslen returns the number of code units, not codepoints in the string, but that's not uncommon in programming languages. You don't often need to know the length of a string in code points, which are not the same thing as user-perceived characters, but you often need to know the length of a string in code units for memory allocation or low-level string manipulation.

@jchv

This comment has been minimized.

Show comment Hide comment
@jchv

jchv Nov 14, 2017

@rossy
I think at this point we're down to semantics. The point is that the Windows APIs accept UCS-2 and treat it as UTF-16 for display purposes. Windows has dedicated UCS-2 "Wide" APIs, but Linux really only has one API for anything that deals with locale (and anything that doesn't.) Treating filenames as raw bytes works well in this system because the encoding can actually be different on disk. If I'm not mistaking, some kind of truly masochistic person could have Shift-JIS-encoded filenames right next to UTF-8, and it'd work right so as long as your locale is set properly when accessing the files.

But in my opinion there is eventually a difference worth noting. What happens when the data comes back? If the SetWindowTextW API were truly UTF-16, you'd expect that the invalid surrogate pairs would be replaced with Unicode replacement characters. After all, this HAS to be done to even display the title - which Windows does do for display purposes. Alas, if I do... (please excuse the crappiness)

   wchar_t brokenSurrogatePairs[5] = { 0xD852, 0xDF62, 0xDF62, 0xD852, 0 };
   wchar_t returnData[5] = { 0 };
   SetWindowTextW(hWnd, brokenSurrogatePairs);
   GetWindowTextW(hWnd, returnData, 5);

Here 0xD852, 0xDF62 forms the (valid) codepoint π€­’ (U+24B62) and the remaining 2 words are the same surrogate pairs but reversed. Windows renders it more or less how you might expect:

image

But returnData yields a (unsurprising?) result:
image

So it treats it as a bag of words, like you said.

This may be for purely legacy reasons, or maybe they just feel it's better to treat it as a bag of words. But the same invariant does not hold for the ANSI/locale-based APIs, which will lose information if. A lot of arguments could be made here. But as far as I can tell, this APIs behaviors exhibit absolutely zero UTF-16 awareness; the end result on screen will display valid UTF-16, but the API itself doesn't do anything, meaning that both on the input and output side, invalid UTF-16 can be accepted and emitted.

To me the API itself is not UTF-16. It is cool that it Windows will display it like UTF-16, but UTF-16 is a variable-length encoding and Windows doesn't treat it the same way it would treat other variable-length encodings. To me this is no more a UTF-16 API than fopen on Linux is a UTF-8 API.

I realize this is all horridly pedantic, but after keeping all of my frustrations about locale and text encoding in operating systems bottled up, it feels relieving to have expressed all of these thoughts. Rant over.

Footnote: Basically to be completely clear, what I'd really expect here is for Windows to error out if you pass in invalid UTF-16, and therefore it would never accept nor emit invalid UTF-16. That's what a 'UTF-16' API is to me. Otherwise it's just an array of 16-bit words that eventually gets treated as UTF-16.

jchv replied Nov 14, 2017

@rossy
I think at this point we're down to semantics. The point is that the Windows APIs accept UCS-2 and treat it as UTF-16 for display purposes. Windows has dedicated UCS-2 "Wide" APIs, but Linux really only has one API for anything that deals with locale (and anything that doesn't.) Treating filenames as raw bytes works well in this system because the encoding can actually be different on disk. If I'm not mistaking, some kind of truly masochistic person could have Shift-JIS-encoded filenames right next to UTF-8, and it'd work right so as long as your locale is set properly when accessing the files.

But in my opinion there is eventually a difference worth noting. What happens when the data comes back? If the SetWindowTextW API were truly UTF-16, you'd expect that the invalid surrogate pairs would be replaced with Unicode replacement characters. After all, this HAS to be done to even display the title - which Windows does do for display purposes. Alas, if I do... (please excuse the crappiness)

   wchar_t brokenSurrogatePairs[5] = { 0xD852, 0xDF62, 0xDF62, 0xD852, 0 };
   wchar_t returnData[5] = { 0 };
   SetWindowTextW(hWnd, brokenSurrogatePairs);
   GetWindowTextW(hWnd, returnData, 5);

Here 0xD852, 0xDF62 forms the (valid) codepoint π€­’ (U+24B62) and the remaining 2 words are the same surrogate pairs but reversed. Windows renders it more or less how you might expect:

image

But returnData yields a (unsurprising?) result:
image

So it treats it as a bag of words, like you said.

This may be for purely legacy reasons, or maybe they just feel it's better to treat it as a bag of words. But the same invariant does not hold for the ANSI/locale-based APIs, which will lose information if. A lot of arguments could be made here. But as far as I can tell, this APIs behaviors exhibit absolutely zero UTF-16 awareness; the end result on screen will display valid UTF-16, but the API itself doesn't do anything, meaning that both on the input and output side, invalid UTF-16 can be accepted and emitted.

To me the API itself is not UTF-16. It is cool that it Windows will display it like UTF-16, but UTF-16 is a variable-length encoding and Windows doesn't treat it the same way it would treat other variable-length encodings. To me this is no more a UTF-16 API than fopen on Linux is a UTF-8 API.

I realize this is all horridly pedantic, but after keeping all of my frustrations about locale and text encoding in operating systems bottled up, it feels relieving to have expressed all of these thoughts. Rant over.

Footnote: Basically to be completely clear, what I'd really expect here is for Windows to error out if you pass in invalid UTF-16, and therefore it would never accept nor emit invalid UTF-16. That's what a 'UTF-16' API is to me. Otherwise it's just an array of 16-bit words that eventually gets treated as UTF-16.

@infinity0

This comment has been minimized.

Show comment Hide comment
@infinity0

infinity0 Nov 14, 2017

@jchv Right, that is exactly the situation as I understand it.

Essentially, it's an API which exposes u16[], you always pass in a u16[], and for some fraction of the provided function calls (generally UI-related or locale-related stuff) they effectively do an internal UTF-16 decode/validation. It's simply not useful to call the u16[] "UTF-16", the better way of describing it is "u16[] but certain API calls transparently decode it as UTF-16 if they really have to".

infinity0 replied Nov 14, 2017

@jchv Right, that is exactly the situation as I understand it.

Essentially, it's an API which exposes u16[], you always pass in a u16[], and for some fraction of the provided function calls (generally UI-related or locale-related stuff) they effectively do an internal UTF-16 decode/validation. It's simply not useful to call the u16[] "UTF-16", the better way of describing it is "u16[] but certain API calls transparently decode it as UTF-16 if they really have to".

@kilobyte

This comment has been minimized.

Show comment Hide comment
@kilobyte

kilobyte Nov 21, 2017

You're using way too kind words, and miss a lot of retardness. For example:

  • wchar_t is 16 bits on Windows (and in languages Sun inflicted upon us), forcing us to rewrite everything ourselves
  • iswfoo() functions don't work in glibc in the C/POSIX locale despite working in everything else, even en_US.ISO-8859-1 (the latter knows about characters past U+00FF despite having no non-wide representation)
  • same for wcwidth() and friends
    β€’ collation files for C.UTF-8 in glibc are massive despite collation being same as C (ie, compare either bytes or codepoints without any tables, it's equivalent for all legal strings)
  • in Turkic locales, stricmp('i', 'I') fails
  • the whole concept of "title case", for the snowflake of Η² (plus accents) which isn't even used by any languages with this digraph ('D' 'z' works well) and some supposedly Ancient Greek combinations where capitals are "title case" with no actual "capital" (Ancient Greek had no lowercase, much less title case)

A great post, though.

You're using way too kind words, and miss a lot of retardness. For example:

  • wchar_t is 16 bits on Windows (and in languages Sun inflicted upon us), forcing us to rewrite everything ourselves
  • iswfoo() functions don't work in glibc in the C/POSIX locale despite working in everything else, even en_US.ISO-8859-1 (the latter knows about characters past U+00FF despite having no non-wide representation)
  • same for wcwidth() and friends
    β€’ collation files for C.UTF-8 in glibc are massive despite collation being same as C (ie, compare either bytes or codepoints without any tables, it's equivalent for all legal strings)
  • in Turkic locales, stricmp('i', 'I') fails
  • the whole concept of "title case", for the snowflake of Η² (plus accents) which isn't even used by any languages with this digraph ('D' 'z' works well) and some supposedly Ancient Greek combinations where capitals are "title case" with no actual "capital" (Ancient Greek had no lowercase, much less title case)

A great post, though.

@krytarowski

This comment has been minimized.

Show comment Hide comment
@krytarowski

krytarowski Nov 22, 2017

wchar_t is not just Windows, but also NetBSD and FreeBSD.

wchar_t is not just Windows, but also NetBSD and FreeBSD.

@kilobyte

This comment has been minimized.

Show comment Hide comment
@kilobyte

kilobyte Nov 22, 2017

@krytarowski: just checked, NetBSD and FreeBSD have sizeof(wchar_t) = 4. It's only Windows that's a special snowflake that violates the C standard:

wchar_t
which is an integer type whose range of values can represent distinct codes for all
members of the largest extended character set specified among the supported locales

But it takes just a single snowflake that you can't ignore...

@krytarowski: just checked, NetBSD and FreeBSD have sizeof(wchar_t) = 4. It's only Windows that's a special snowflake that violates the C standard:

wchar_t
which is an integer type whose range of values can represent distinct codes for all
members of the largest extended character set specified among the supported locales

But it takes just a single snowflake that you can't ignore...

@krytarowski

This comment has been minimized.

Show comment Hide comment
@krytarowski

krytarowski Nov 22, 2017

Ah right.. I mixed something.

Ah right.. I mixed something.

@z5122495

This comment has been minimized.

Show comment Hide comment
@z5122495

z5122495 Apr 13, 2018

You're just wrong.

You're just wrong.

Please sign in to comment.