Skip to content

Commit

Permalink
Merge pull request #2905 from bfredl/utf8
Browse files Browse the repository at this point in the history
Only allow encoding=utf-8 and simplify multibyte code
  • Loading branch information
bfredl committed Nov 5, 2016
2 parents 32d9c19 + 4ab3fe8 commit 9147331
Show file tree
Hide file tree
Showing 21 changed files with 261 additions and 1,062 deletions.
66 changes: 22 additions & 44 deletions runtime/doc/eval.txt
Expand Up @@ -1029,8 +1029,8 @@ A string constant accepts these special characters:
\x. byte specified with one hex number (must be followed by non-hex char)
\X.. same as \x..
\X. same as \x.
\u.... character specified with up to 4 hex numbers, stored according to the
current value of 'encoding' (e.g., "\u02a4")
\u.... character specified with up to 4 hex numbers, stored as UTF-8
(e.g., "\u02a4")
\U.... same as \u but allows up to 8 hex numbers.
\b backspace <BS>
\e escape <Esc>
Expand All @@ -1045,8 +1045,7 @@ A string constant accepts these special characters:
utf-8 character, use \uxxxx as mentioned above.

Note that "\xff" is stored as the byte 255, which may be invalid in some
encodings. Use "\u00ff" to store character 255 according to the current value
of 'encoding'.
encodings. Use "\u00ff" to store character 255 correctly as UTF-8.

Note that "\000" and "\x00" force the end of the string.

Expand Down Expand Up @@ -2532,8 +2531,6 @@ byteidxcomp({expr}, {nr}) *byteidxcomp()*
< The first and third echo result in 3 ('e' plus composing
character is 3 bytes), the second echo results in 1 ('e' is
one byte).
Only works different from byteidx() when 'encoding' is set to
a Unicode encoding.

call({func}, {arglist} [, {dict}]) *call()* *E699*
Call function {func} with the items in |List| {arglist} as
Expand Down Expand Up @@ -2568,11 +2565,11 @@ char2nr({expr}[, {utf8}]) *char2nr()*
Return number value of the first char in {expr}. Examples: >
char2nr(" ") returns 32
char2nr("ABC") returns 65
< When {utf8} is omitted or zero, the current 'encoding' is used.
Example for "utf-8": >
char2nr("á") returns 225
char2nr("á"[0]) returns 195
< With {utf8} set to 1, always treat as utf-8 characters.
< Non-ASCII characters are always treated as UTF-8 characters.
{utf8} has no effect, and exists only for
backwards-compatibility.
A combining character is a separate character.
|nr2char()| does the opposite.

Expand Down Expand Up @@ -4225,11 +4222,7 @@ iconv({expr}, {from}, {to}) *iconv()*
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use: >
echo iconv(utf8_str, "utf-8", &enc)
< Note that Vim uses UTF-8 for all Unicode encodings, conversion
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the |+multi_byte| feature}
Expand Down Expand Up @@ -4513,43 +4506,30 @@ join({list} [, {sep}]) *join()*
json_decode({expr}) *json_decode()*
Convert {expr} from JSON object. Accepts |readfile()|-style
list as the input, as well as regular string. May output any
Vim value. When 'encoding' is not UTF-8 string is converted
from UTF-8 to 'encoding', failing conversion fails
json_decode(). In the following cases it will output
Vim value. In the following cases it will output
|msgpack-special-dict|:
1. Dictionary contains duplicate key.
2. Dictionary contains empty key.
3. String contains NUL byte. Two special dictionaries: for
dictionary and for string will be emitted in case string
with NUL byte was a dictionary key.

Note: function treats its input as UTF-8 always regardless of
'encoding' value. This is needed because JSON source is
supposed to be external (e.g. |readfile()|) and JSON standard
allows only a few encodings, of which UTF-8 is recommended and
the only one required to be supported. Non-UTF-8 characters
are an error.
Note: function treats its input as UTF-8 always. The JSON
standard allows only a few encodings, of which UTF-8 is
recommended and the only one required to be supported.
Non-UTF-8 characters are an error.

json_encode({expr}) *json_encode()*
Convert {expr} into a JSON string. Accepts
|msgpack-special-dict| as the input. Converts from 'encoding'
to UTF-8 when encoding strings. Will not convert |Funcref|s,
|msgpack-special-dict| as the input. Will not convert |Funcref|s,
mappings with non-string keys (can be created as
|msgpack-special-dict|), values with self-referencing
containers, strings which contain non-UTF-8 characters,
pseudo-UTF-8 strings which contain codepoints reserved for
surrogate pairs (such strings are not valid UTF-8 strings).
When converting 'encoding' is taken into account, if it is not
"utf-8", then conversion is performed before encoding strings.
Non-printable characters are converted into "\u1234" escapes
or special escapes like "\t", other are dumped as-is.

Note: all characters above U+0079 are considered non-printable
when 'encoding' is not UTF-8. This function always outputs
UTF-8 strings as required by the standard thus when 'encoding'
is not unicode resulting string will look incorrect if
"\u1234" notation is not used.

keys({dict}) *keys()*
Return a |List| with all the keys of {dict}. The |List| is in
arbitrary order.
Expand Down Expand Up @@ -4651,9 +4631,9 @@ line2byte({lnum}) *line2byte()*
Return the byte count from the start of the buffer for line
{lnum}. This includes the end-of-line character, depending on
the 'fileformat' option for the current buffer. The first
line returns 1. 'encoding' matters, 'fileencoding' is ignored.
This can also be used to get the byte count for the line just
below the last line: >
line returns 1. UTF-8 encoding is used, 'fileencoding' is
ignored. This can also be used to get the byte count for the
line just below the last line: >
line2byte(line("$") + 1)
< This is the buffer size plus one. If 'fileencoding' is empty
it is the file size plus one.
Expand Down Expand Up @@ -5172,10 +5152,10 @@ nr2char({expr}[, {utf8}]) *nr2char()*
value {expr}. Examples: >
nr2char(64) returns "@"
nr2char(32) returns " "
< When {utf8} is omitted or zero, the current 'encoding' is used.
Example for "utf-8": >
< Example for "utf-8": >
nr2char(300) returns I with bow character
< With {utf8} set to 1, always return utf-8 characters.
< UTF-8 encoding is always used, {utf8} option has no effect,
and exists only for backwards-compatibility.
Note that a NUL character in the file is specified with
nr2char(10), because NULs are represented with newline
characters. nr2char(0) is a real NUL and terminates the
Expand Down Expand Up @@ -5417,7 +5397,7 @@ py3eval({expr}) *py3eval()*
converted to Vim data structures.
Numbers and strings are returned as they are (strings are
copied though, Unicode strings are additionally converted to
'encoding').
UTF-8).
Lists are represented as Vim |List| type.
Dictionaries are represented as Vim |Dictionary| type with
keys converted to strings.
Expand Down Expand Up @@ -5467,8 +5447,7 @@ readfile({fname} [, {binary} [, {max}]])
Otherwise:
- CR characters that appear before a NL are removed.
- Whether the last line ends in a NL or not does not matter.
- When 'encoding' is Unicode any UTF-8 byte order mark is
removed from the text.
- Any UTF-8 byte order mark is removed from the text.
When {max} is given this specifies the maximum number of lines
to be read. Useful if you only want to check the first ten
lines of a file: >
Expand Down Expand Up @@ -6621,8 +6600,7 @@ string({expr}) Return {expr} converted to a String. If {expr} is a Number,
for infinite and NaN floating-point values representations
which use |str2float()|. Strings are also dumped literally,
only single quote is escaped, which does not allow using YAML
for parsing back binary strings (including text when
'encoding' is not UTF-8). |eval()| should always work for
for parsing back binary strings. |eval()| should always work for
strings and floats though and this is the only official
method, use |msgpackdump()| or |json_encode()| if you need to
share data with other application.
Expand Down
88 changes: 34 additions & 54 deletions runtime/doc/mbyte.txt
Expand Up @@ -70,29 +70,24 @@ See |mbyte-locale| for details.

ENCODING

If your locale works properly, Vim will try to set the 'encoding' option
accordingly. If this doesn't work you can overrule its value: >
Nvim always uses UTF-8 internally. Thus 'encoding' option is always set
to "utf-8" and cannot be changed.

:set encoding=utf-8
All the text that is used inside Vim will be in UTF-8. Not only the text in
the buffers, but also in registers, variables, etc.

See |encoding-values| for a list of acceptable values.

The result is that all the text that is used inside Vim will be in this
encoding. Not only the text in the buffers, but also in registers, variables,
etc. 'encoding' is read-only after startup because changing it would make the
existing text invalid.

You can edit files in another encoding than what 'encoding' is set to. Vim
You can edit files in different encodings than UTF-8. Nvim
will convert the file when you read it and convert it back when you write it.
See 'fileencoding', 'fileencodings' and |++enc|.


DISPLAY AND FONTS

If you are working in a terminal (emulator) you must make sure it accepts the
same encoding as which Vim is working with.
If you are working in a terminal (emulator) you must make sure it accepts
UTF-8, the encoding which Vim is working with. Otherwise only ASCII can
be displayed and edited correctly.

For the GUI you must select fonts that work with the current 'encoding'. This
For the GUI you must select fonts that work with UTF-8. This
is the difficult part. It depends on the system you are using, the locale and
a few other things. See the chapters on fonts: |mbyte-fonts-X11| for
X-Windows and |mbyte-fonts-MSwin| for MS-Windows.
Expand Down Expand Up @@ -216,10 +211,9 @@ You could make a small shell script for this.
==============================================================================
3. Encoding *mbyte-encoding*

Vim uses the 'encoding' option to specify how characters are identified and
encoded when they are used inside Vim. This applies to all the places where
text is used, including buffers (files loaded into memory), registers and
variables.
In Nvim UTF-8 is always used internally to encode characters.
This applies to all the places where text is used, including buffers (files
loaded into memory), registers and variables.

*charset* *codeset*
Charset is another name for encoding. There are subtle differences, but these
Expand All @@ -240,7 +234,7 @@ matter what language is used. Thus you might see the right text even when the
encoding was set wrong.

*encoding-names*
Vim can use many different character encodings. There are three major groups:
Vim can edit files in different character encodings. There are three major groups:

1 8bit Single-byte encodings, 256 different characters. Mostly used
in USA and Europe. Example: ISO-8859-1 (Latin1). All
Expand All @@ -255,11 +249,10 @@ u Unicode Universal encoding, can replace all others. ISO 10646.
Millions of different characters. Example: UTF-8. The
relation between bytes and screen cells is complex.

Other encodings cannot be used by Vim internally. But files in other
Only UTF-8 is used by Vim internally. But files in other
encodings can be edited by using conversion, see 'fileencoding'.
Note that all encodings must use ASCII for the characters up to 128.

Supported 'encoding' values are: *encoding-values*
Recognized 'fileencoding' values include: *encoding-values*
1 latin1 8-bit characters (ISO 8859-1, also used for cp1252)
1 iso-8859-n ISO_8859 variant (n = 2 to 15)
1 koi8-r Russian
Expand Down Expand Up @@ -311,11 +304,11 @@ u ucs-4 32 bit UCS-4 encoded Unicode (ISO/IEC 10646-1)
u ucs-4le like ucs-4, little endian

The {name} can be any encoding name that your system supports. It is passed
to iconv() to convert between the encoding of the file and the current locale.
to iconv() to convert between UTF-8 and the encoding of the file.
For MS-Windows "cp{number}" means using codepage {number}.
Examples: >
:set encoding=8bit-cp1252
:set encoding=2byte-cp932
:set fileencoding=8bit-cp1252
:set fileencoding=2byte-cp932
The MS-Windows codepage 1252 is very similar to latin1. For practical reasons
the same encoding is used and it's called latin1. 'isprint' can be used to
Expand All @@ -337,8 +330,7 @@ u ucs-2be same as ucs-2 (big endian)
u ucs-4be same as ucs-4 (big endian)
u utf-32 same as ucs-4
u utf-32le same as ucs-4le
default stands for the default value of 'encoding', depends on the
environment
default the encoding of the current locale.

For the UCS codes the byte order matters. This is tricky, use UTF-8 whenever
you can. The default is to use big-endian (most significant byte comes
Expand All @@ -363,13 +355,12 @@ or when conversion is not possible:
CONVERSION *charset-conversion*

Vim will automatically convert from one to another encoding in several places:
- When reading a file and 'fileencoding' is different from 'encoding'
- When writing a file and 'fileencoding' is different from 'encoding'
- When reading a file and 'fileencoding' is different from "utf-8"
- When writing a file and 'fileencoding' is different from "utf-8"
- When displaying messages and the encoding used for LC_MESSAGES differs from
'encoding' (requires a gettext version that supports this).
"utf-8" (requires a gettext version that supports this).
- When reading a Vim script where |:scriptencoding| is different from
'encoding'.
- When reading or writing a |shada| file.
"utf-8".
Most of these require the |+iconv| feature. Conversion for reading and
writing files may also be specified with the 'charconvert' option.

Expand Down Expand Up @@ -408,11 +399,11 @@ Useful utilities for converting the charset:


*mbyte-conversion*
When reading and writing files in an encoding different from 'encoding',
When reading and writing files in an encoding different from "utf-8",
conversion needs to be done. These conversions are supported:
- All conversions between Latin-1 (ISO-8859-1), UTF-8, UCS-2 and UCS-4 are
handled internally.
- For MS-Windows, when 'encoding' is a Unicode encoding, conversion from and
- For MS-Windows, conversion from and
to any codepage should work.
- Conversion specified with 'charconvert'
- Conversion with the iconv library, if it is available.
Expand Down Expand Up @@ -468,8 +459,6 @@ and you will have a working UTF-8 terminal emulator. Try both >
with the demo text that comes with ucs-fonts.tar.gz in order to see
whether there are any problems with UTF-8 in your xterm.

For Vim you may need to set 'encoding' to "utf-8".

==============================================================================
5. Fonts on X11 *mbyte-fonts-X11*

Expand Down Expand Up @@ -864,11 +853,11 @@ between two keyboard settings.
The value of the 'keymap' option specifies a keymap file to use. The name of
this file is one of these two:

keymap/{keymap}_{encoding}.vim
keymap/{keymap}_utf-8.vim
keymap/{keymap}.vim

Here {keymap} is the value of the 'keymap' option and {encoding} of the
'encoding' option. The file name with the {encoding} included is tried first.
Here {keymap} is the value of the 'keymap' option.
The file name with "utf-8" included is tried first.

'runtimepath' is used to find these files. To see an overview of all
available keymap files, use this: >
Expand Down Expand Up @@ -950,7 +939,7 @@ this is unusual. But you can use various ways to specify the character: >
A <char-0141> octal value
x <Space> special key name
The characters are assumed to be encoded for the current value of 'encoding'.
The characters are assumed to be encoded in UTF-8.
It's possible to use ":scriptencoding" when all characters are given
literally. That doesn't work when using the <char-> construct, because the
conversion is done on the keymap file, not on the resulting character.
Expand Down Expand Up @@ -1170,21 +1159,13 @@ Useful commands:
message is truncated, use ":messages").
- "g8" shows the bytes used in a UTF-8 character, also the composing
characters, as hex numbers.
- ":set encoding=utf-8 fileencodings=" forces using UTF-8 for all files. The
default is to use the current locale for 'encoding' and set 'fileencodings'
to automatically detect the encoding of a file.
- ":set fileencodings=" forces using UTF-8 for all files. The
default is to automatically detect the encoding of a file.


STARTING VIM

If your current locale is in an utf-8 encoding, Vim will automatically start
in utf-8 mode.

If you are using another locale: >
set encoding=utf-8
You might also want to select the font used for the menus. Unfortunately this
You might want to select the font used for the menus. Unfortunately this
doesn't always work. See the system specific remarks below, and 'langmenu'.


Expand Down Expand Up @@ -1245,10 +1226,9 @@ not everybody is able to type a composing character.
These options are relevant for editing multi-byte files. Check the help in
options.txt for detailed information.

'encoding' Encoding used for the keyboard and display. It is also the
default encoding for files.
'encoding' Internal text encoding, always "utf-8".

'fileencoding' Encoding of a file. When it's different from 'encoding'
'fileencoding' Encoding of a file. When it's different from "utf-8"
conversion is done when reading or writing the file.

'fileencodings' List of possible encodings of a file. When opening a file
Expand Down

0 comments on commit 9147331

Please sign in to comment.