Since Windows 2000, Windows offers a nice Unicode API and supports :ref:`non-BMP characters <bmp>`. It uses :ref:`Unicode strings <str>` implemented as :c:type:`wchar_t*` strings (LPWSTR). :c:type:`wchar_t` is 16 bits long on Windows and so it uses :ref:`UTF-16 <utf16>`: :ref:`non-BMP <bmp>` characters are stored as two :c:type:`wchar_t` (a :ref:`surrogate pair <surrogates>`), and the length of a string is the number of UTF-16 units and not the number of characters.
A Windows application has two encodings, called code pages (abbreviated "cp"): ANSI and OEM code pages. The ANSI code page, :c:macro:`CP_ACP`, is used for the ANSI version of the :ref:`Windows API <win_api>` to decode :ref:`byte strings <bytes>` to :ref:`character strings <str>` and has a number between 874 and 1258. The OEM code page or "IBM PC" code page, :c:macro:`CP_OEMCP`, comes from MS-DOS, is used for the :ref:`Windows console <win_console>`, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is :ref:`cp1252` and OEM is cp850.
There are code page constants:
- :c:macro:`CP_ACP`: Windows ANSI code page
- :c:macro:`CP_MACCP`: Macintosh code page
- :c:macro:`CP_OEMCP`: ANSI code page of the current process
- :c:macro:`CP_SYMBOL` (42): Symbol code page
- :c:macro:`CP_THREAD_ACP`: ANSI code page of the current thread
- :c:macro:`CP_UTF7` (65000): :ref:`UTF-7 <utf7>`
- :c:macro:`CP_UTF8` (65001): :ref:`UTF-8 <utf8>`
Encode and decode functions
Encode and decode functions of
Windows API: ANSI and wide versions
Windows has two versions of each function of its API: the ANSI version using
:ref:`byte strings <bytes>` (
A suffix) and the :ref:`ANSI code page
<codepage>`, and the wide version (
W suffix) using :ref:`character strings
<str>`. There are also functions without suffix using :c:type:`TCHAR*` strings:
if the :ref:`C <c>` define :c:macro:`_UNICODE` is defined, :c:type:`TCHAR` is
replaced by :c:type:`wchar_t` and the Unicode functions are used; otherwise
:c:type:`TCHAR` is replaced by :c:type:`char` and the ANSI functions are used.
Always prefer the Unicode version to avoid encoding/decoding errors, and use
W suffix to avoid compiling issues.
There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define :c:macro:`_MBCS` to use the MBCS functions. For example, :c:func:`_tcsrev` is replaced by :c:func:`_mbsrev` if :c:macro:`_MBCS` is defined, by :c:func:`_wcsrev` if :c:macro:`_UNICODE` is defined, or by :c:func:`_strrev` otherwise.
Windows string types
- LPSTR (LPCSTR): :ref:`byte string <bytes>`, :c:type:`char*` (:c:type:`const char*`)
- LPWSTR (LPCWSTR): :ref:`wide character string <str>`, :c:type:`wchar_t*` (:c:type:`const wchar_t*`)
- LPTSTR (LPCTSTR): byte or wide character string depending of
_UNICODEdefine, :c:type:`TCHAR*` (:c:type:`const TCHAR*`)
Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:
To improve the :ref:`Unicode support <support>` of the console, set the console font to a TrueType font (e.g. "Lucida Console") and use the wide character API
If the console is unable to render a character, it tries to use a :ref:`character with a similar glyph <translit>`. For example, with OEM :ref:`code page <codepage>` 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, "?" (U+003F) is displayed instead.
In a console (
chcp command can be used to display or to
change the :ref:`OEM code page <codepage>` (and console code page). Change the
console code page is not a good idea because the ANSI API of the console still
expect characters encoded to the previous console code page.
Set the console :ref:`code page <codepage>` to cp65001 (:ref:`UTF-8`) doesn't improve Unicode support, it is the opposite: non-ASCII are not rendered correctly and type non-ASCII characters (e.g. using the keyboard) doesn't work correctly, especially using raster fonts.
:c:func:`fopen` can use these modes using
ccs= in the file mode:
Mac OS X
Mac OS X uses :ref:`UTF-8` for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X :ref:`returns an error <strict>`. The filenames are :ref:`decomposed <normalization>` to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: "For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed."
To support different languages and encodings, UNIX and BSD operating systems have "locales". Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.
- :c:macro:`LC_COLLATE`: compare and sort strings
- :c:macro:`LC_CTYPE`: decode :ref:`byte strings <bytes>` and encode :ref:`character strings <str>`
- :c:macro:`LC_MESSAGES`: language of messages
- :c:macro:`LC_MONETARY`: monetary formatting
- :c:macro:`LC_NUMERIC`: number formatting (e.g. thousands separator)
- :c:macro:`LC_TIME`: time and date formatting
:c:macro:`LC_ALL` is a special category: if you set a locale using this category, it sets the locale for all categories.
Each category has its own environment variable with the same name. For
LC_MESSAGES=C displays error messages in English. To get the
value of a locale category,
LANG environment variables are checked: use the first non empty variable.
If all variables are unset, fallback to the C locale.
The gettext library reads
variables (and some others) to get the user language. The
variable is specific to gettext and is not related to locales.
The C locale
When a program starts, it does not get directly the user locale: it uses the default locale which is called the "C" locale or the "POSIX" locale. It is also used if no locale environment variable is set. For :c:macro:`LC_CTYPE`, the C locale usually means :ref:`ASCII`, but not always (see the locale encoding section). For :c:macro:`LC_MESSAGES`, the C locale means to speak the original language of the program, which is usually English.
For Unicode, the most important locale category is
LC_CTYPE: it is used to
set the "locale encoding".
To get the locale encoding:
- Copy the current locale:
- Set the current locale encoding to the user preference:
For the C locale,
nl_langinfo(CODESET) returns :ref:`ASCII`, or an alias
to this encoding (e.g. "US-ASCII" or "646"). But on FreeBSD, Solaris and
:ref:`Mac OS X <osx>`, codec functions (e.g. :c:func:`mbstowcs`) use
:ref:`ISO-8859-1` even if
nl_langinfo(CODESET) announces ASCII encoding.
AIX uses :ref:`ISO-8859-1` for the C locale (and
"mbs" stands for "multibyte string" (byte string) and "wcs" stands for "wide character string".
On Windows, the "locale encoding" are the :ref:`ANSI and OEM code pages <codepage>`. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.
CD-ROM and DVD
CD-ROM uses the ISO 9660 filesystem which stores filenames as :ref:`byte strings <bytes>`. This filesystem is very restrictive: only A-Z, 0-9, _ and "." are allowed. Microsoft has developed the Joliet extension: store filenames as :ref:`UCS-2 <ucs2>`, up to 64 characters (:ref:`BMP <bmp>` only). It was first supported by Windows 95. Today, all operating systems are able to read it.
UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.
Microsoft: FAT and NTFS filesystems
MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as :ref:`byte strings <bytes>`. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the :ref:`code page <codepage>` (:ref:`mojibake issue <mojibake>`).
Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports "long filenames", filenames are stored as :ref:`UCS-2 <ucs2>`, up to 255 characters (BMP only). Starting at Windows 2000, :ref:`non-BMP characters <bmp>` can be used: :ref:`UTF-16 <utf16>` replaces UCS-2 and the limit is now 255 UTF-16 units.
The NTFS filesystem stores filenames using UTF-16 encoding.
Apple: HFS and HFS+ filesystems
HFS stores filenames as byte strings.
HFS+ stores filenames as :ref:`UTF-16 <utf16>`: the maximum length is 255 UTF-16 units.
JFS and ZFS also use Unicode.
The ext family (ext2, ext3, ext4) store filenames as byte strings.