Operating systems

write an intro for all OS?

Windows

Since Windows 2000, Windows offers a nice Unicode API and supports non-BMP characters <bmp>. It uses Unicode strings <str> implemented as :cwchar_t* strings (LPWSTR). :cwchar_t is 16 bits long on Windows and so it uses UTF-16 <utf16>: non-BMP <bmp> characters are stored as two :cwchar_t (a surrogate pair <surrogates>), and the length of a string is the number of UTF-16 units and not the number of characters.

Windows 95, 98 and Me had also Unicode strings, but were limited to BMP characters <bmp>: they used UCS-2 <ucs2> instead of UTF-16.

And Windows CE?

Code pages

A Windows application has two encodings, called code pages (abbreviated "cp"): ANSI and OEM code pages. The ANSI code page, :cCP_ACP, is used for the ANSI version of the Windows API <win_api> to decode byte strings <bytes> to character strings <str> and has a number between 874 and 1258. The OEM code page or "IBM PC" code page, :cCP_OEMCP, comes from MS-DOS, is used for the Windows console <win_console>, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252 and OEM is cp850.

There are code page constants:

:cCP_ACP: Windows ANSI code page

:cCP_MACCP: Macintosh code page

:cCP_OEMCP: ANSI code page of the current process

:cCP_SYMBOL (42): Symbol code page

:cCP_THREAD_ACP: ANSI code page of the current thread

:cCP_UTF7 (65000): UTF-7 <utf7>

:cCP_UTF8 (65001): UTF-8 <utf8>

Functions.

Wikipedia article: Windows code page.

Encode and decode functions

Encode and decode functions of <windows.h>.

Note

:cMultiByteToWideChar and :cWideCharToMultiByte functions are similar to :cmbstowcs and :cwcstombs functions.

Document NormalizeString()

Document the replacement character?

Windows API: ANSI and wide versions

Windows has two versions of each function of its API: the ANSI version using byte strings <bytes> (A suffix) and the ANSI code page <codepage>, and the wide version (W suffix) using character strings <str>. There are also functions without suffix using :cTCHAR* strings: if the C <c> define :c_UNICODE is defined, :cTCHAR is replaced by :cwchar_t and the Unicode functions are used; otherwise :cTCHAR is replaced by :cchar and the ANSI functions are used. Example:

:cCreateFileA(): bytes version, use byte strings <bytes> encoded to the ANSI code page

:cCreateFileW(): Unicode version, use wide character strings <str>

:cCreateFile(): :cTCHAR version depending on the :c_UNICODE define

Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the W suffix to avoid compiling issues.

Note

There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define :c_MBCS to use the MBCS functions. For example, :c_tcsrev is replaced by :c_mbsrev if :c_MBCS is defined, by :c_wcsrev if :c_UNICODE is defined, or by :c_strrev otherwise.

Windows string types

LPSTR (LPCSTR): byte string <bytes>, :cchar* (:cconst char*)

LPWSTR (LPCWSTR): wide character string <str>, :cwchar_t* (:cconst wchar_t*)

LPTSTR (LPCTSTR): byte or wide character string depending of _UNICODE define, :cTCHAR* (:cconst TCHAR*)

Filenames

Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:

POSIX functions, like :cfopen(), use the ANSI code page <codepage> to encode/decode strings.

Windows console

Console functions.

document ReadConsoleW()?

To improve the Unicode support <support> of the console, set the console font to a TrueType font (e.g. "Lucida Console") and use the wide character API

If the console is unable to render a character, it tries to use a character with a similar glyph <translit>. For example, with OEM code page <codepage> 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, "?" (U+003F) is displayed instead.

In a console (cmd.exe), chcp command can be used to display or to change the OEM code page <codepage> (and console code page). Changing the console code page is not a good idea because the ANSI API of the console still expects characters encoded to the previous console code page.

Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? (Michael S. Kaplan, 2008) and the Python bug report #1602: windows console doesn't print or input Unicode.

Note

Set the console code page <codepage> to cp65001 (UTF-8) doesn't improve Unicode support, it is the opposite: non-ASCII are not rendered correctly and type non-ASCII characters (e.g. using the keyboard) doesn't work correctly, especially using raster fonts.

File mode

:c_setmode and :c_wsopen are special functions to set the encoding of a file:

:c_O_U8TEXT: UTF-8 without BOM <bom>

:c_O_U16TEXT: UTF-16 <utf16> without BOM

:c_O_WTEXT: UTF-16 with BOM

:cfopen can use these modes using ccs= in the file mode:

ccs=UNICODE: :c_O_WTEXT

ccs=UTF-8: :c_O_UTF8

ccs=UTF-16LE: :c_O_UTF16

Consequences on TTY and pipes?

Mac OS X

Mac OS X uses UTF-8 for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error <strict>. The filenames are decomposed <normalization> to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: "For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed."

Locales

To support different languages and encodings, UNIX and BSD operating systems have "locales". Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.

Locale categories

Locale categories:

:cLC_COLLATE: compare and sort strings

:cLC_CTYPE: decode byte strings <bytes> and encode character strings <str>

:cLC_MESSAGES: language of messages

:cLC_MONETARY: monetary formatting

:cLC_NUMERIC: number formatting (e.g. thousands separator)

:cLC_TIME: time and date formatting

:cLC_ALL is a special category: if you set a locale using this category, it sets the locale for all categories.

Each category has its own environment variable with the same name. For example, LC_MESSAGES=C displays error messages in English. To get the value of a locale category, LC_ALL, LC_xxx (e.g. LC_CTYPE) or LANG environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale.

Note

The gettext library reads LANGUAGE, LC_ALL and LANG environment variables (and some others) to get the user language. The LANGUAGE variable is specific to gettext and is not related to locales.

The C locale

When a program starts, it does not get directly the user locale: it uses the default locale which is called the "C" locale or the "POSIX" locale. It is also used if no locale environment variable is set. For :cLC_CTYPE, the C locale usually means ASCII, but not always (see the locale encoding section). For :cLC_MESSAGES, the C locale means to speak the original language of the program, which is usually English.

Locale encoding

For Unicode, the most important locale category is LC_CTYPE: it is used to set the "locale encoding".

To get the locale encoding:

Copy the current locale: setlocale(LC_CTYPE, NULL)

Set the current locale encoding to the user preference: setlocale(LC_CTYPE, "")

Use nl_langinfo(CODESET) if available

or setlocale(LC_CTYPE, NULL)

write a full example in C

For the C locale, nl_langinfo(CODESET) returns ASCII, or an alias to this encoding (e.g. "US-ASCII" or "646"). But on FreeBSD, Solaris and Mac OS X <osx>, codec functions (e.g. :cmbstowcs) use ISO-8859-1 even if nl_langinfo(CODESET) announces ASCII encoding. AIX uses ISO-8859-1 for the C locale (and nl_langinfo(CODESET) returns "ISO8859-1").

Locale functions

<locale.h> functions.

setlocale("") means user preference

<langinfo.h> functions.

<stdlib.h> functions.

mbstowcs() and wcstombs() are strict <strict> and don't support error handlers <errors>.

Note

"mbs" stands for "multibyte string" (byte string) and "wcs" stands for "wide character string".

On Windows, the "locale encoding" are the ANSI and OEM code pages <codepage>. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.

Filesystems (filenames)

CD-ROM and DVD

CD-ROM uses the ISO 9660 filesystem which stores filenames as byte strings <bytes>. This filesystem is very restrictive: only A-Z, 0-9, _ and "." are allowed. Microsoft has developed the Joliet extension: store filenames as UCS-2 <ucs2>, up to 64 characters (BMP <bmp> only). It was first supported by Windows 95. Today, all operating systems are able to read it.

UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.

UDF encoding?

Microsoft: FAT and NTFS filesystems

MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings <bytes>. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page <codepage> (mojibake issue <mojibake>).

Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports "long filenames", filenames are stored as UCS-2 <ucs2>, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters <bmp> can be used: UTF-16 <utf16> replaces UCS-2 and the limit is now 255 UTF-16 units.

The NTFS filesystem stores filenames using UTF-16 encoding.

Apple: HFS and HFS+ filesystems

HFS stores filenames as byte strings.

HFS+ stores filenames as UTF-16 <utf16>: the maximum length is 255 UTF-16 units.

Others

JFS and ZFS also use Unicode.

The ext family (ext2, ext3, ext4) store filenames as byte strings.

Linux: mount options (FAT, NFSv3)

USB keys, camera, memory cards

Network fileystems like NFS (NFS4 supports Unicode?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly