# Chapter 4. Text versus Bytes

|Unicode编码(十六进制)|UTF-8 字节流(二进制)|
|-------------------|-----------------|
|000000-00007F|0xxxxxxx|
|000080-0007FF|110xxxxx 10xxxxxx|
|000800-00FFFF|1110xxxx 10xxxxxx 10xxxxxx|
|010000-10FFFF|11110xxx 10xxxxxx 10xxxxxx 10xxxxxx|

Computer use bytes. Human speak text.

Python3 introduced a sharp distinction between strings of hunman text and sequences of raw bytes. This chapter deals with Unicode strings, binary sequences, and the encoding to convert between them.

In this chapter, we will visit following topics:

* Charaters, code point, and byte representation
* Unique features of binary sequences: bytes, bytearray, and memoryview
* Codes for full Unicode and legacy character sets
* Avoiding and dealing encoding error
* The default encoding trap and standard I/O issues
* Safe Unicode text comparisons with normalization
* Utility functions for normalization, case folding, and brute-force diacritic remval
* Proper sorting of Unicode text with locale and the PyUCA library
* Character metadata in the Unicode database
* Dual-model APIs that handle str and bytes

## Chapter Issues

**The concepts of "string" is simple enough: a string is a sequence of characters. The problem lies on the definition of "character".**

**In 2015, the best of definition of "character" we have is a Unicode character. The item you get out of a Python3 str are Unicode character, just like items of a unicode object in Python 2, and not the raw bytes you get from a Python2 str.**

**Converting from code points to bytes is *encoding*; converting from bytes to code points is *decoding*.**

In [2]:
s = 'café'

In [3]:
len(s)

4

In [5]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [6]:
len(b)

5

> If you need a memory aid to help distinguish `.decode()` from `.encode()`, convince yourself that byte sequences can be cryptic machine code dumps while Unicode str objects are "human" text. Therefor, it makes sense that we decode bytes to str to get human-readable text, and we encode str to bytes for storage or transimission.

## Byte Essentials

The new binary sequence types are unlike the Python2 `str` in many regards. The first thing to known is that there are two basic built-in types for binary sequences: **the immutable bytes type introduced in Python3 and the mutable `bytearray`, added in Python2.6**

Each item in `bytes` or `bytearray` is an integer from 0 to 255, and not a one-character string like in the Python2 `str`.

In [13]:
# bytes can be built from a str, given an encoding
cafe = bytes('café', encoding='utf8')
cafe

b'caf\xc3\xa9'

In [14]:
# Each item is an integer in range(255)
cafe[0]

99

In [15]:
#slices of bytes are also bytes - even slices of a simple byte.
cafe[:1]

b'c'

In [18]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [21]:
# slice of bytearray is also a bytearray.
cafe_arr[-1:]

bytearray(b'\xa9')

In [23]:
cafe_arr[:1]

bytearray(b'c')

> The fact that `my_bytes[0]` retrieves an int but `my_bytes[:1]` returns a bytes object of length 1 should not be surprising.

Both `bytes` and `bytearray` support every `str` method except those that do formatting(format, format_map) and a few others that depend on Unicode data, including casefold, isdecimal, and encode.

In addition, the regular expression functions in the `re` module also work on binary sequences, if the regex is compiled from a binary sequence instead of a str.

Binary sequence have a class method that str doesn't have, called `fromhex`, which builds a binary sequence by parsing of hex digits optionally separated by spaces:

In [26]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

### Structs and Memory Views

The struct module provides functions to parse packed bytes into a tuple of fields of different types and to perform the opposite conversion, from a tuple into packed bytes. sturct is used with bytes, bytearray, and memoryview objects.

The memoryview class does not let you create or store sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as as Python Imaging Library, without copying the bytes.

## Basic Encoders/Decoders

In [1]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


The encodings were chosen as a representative sample:

*lantin1.a.k.a. iso8859_1*

    Important because it is the basis for other encodings, such as *cp1252* and Unicode itself
    
*cp1252*
    
    A lantin1 superset by Microsoft, adding useful symbols like curly quotes, some Windows apps call it "ANSI", but it was never a real ANSI standard.
    
*gb2312*

    Leacy standard to encode the simplified Chinese ideographs used in mainland China; one of serveral widely deployed multibyte encoding for Asian languages.
    
*utf-8*

    The most common 8-bit encoding on the Web, backward-compatible with ASCII(pure ASCII is valid UTF-8)
    
*utf-16le*

    One from the UTF-16 encoding scheme; all UTF-16 encoding support code points beyond U+FFFF through escape sequences called "surrogate pairs".

## Understanding Encode/Decode Problems

Although there is a generic UnicodeError exception, the error reported is almost always more specific: either a UnicodeEncodeError (when converting str to binary sequences) or UnicodeDecodeError (when reading binary sequences into str). Loading Python modules may also generate a SyntaxError when the source encoding is unexpected.

> The first thing to note when you get a Unicode error is the exact type of the exception. Is it a UnicodeEncodeError, a UnicodeDecodeError, or some other erros that memtions an encoding problem?

### Coping with UnicodeEncodeError

Coping with UnicodeEncodeError (处理UnicodeEncodeError)

Most non-UTF codec handle only a small subset of the Unicode characters. When converting text to bytes, if a character is not defined in the target encoding, UnicodeEncodeError will be raised, unless special handling is provided by passing an errors arguments to the encoding method or function.

In [3]:
city = 'São Paulo'
city.encode('utf-8')

b'S\xc3\xa3o Paulo'

In [12]:
# The "utf-?" encoding hanle any str
city.encode('utf-16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [13]:
city.encode('iso8859-1')

b'S\xe3o Paulo'

In [14]:
# 'cp437' can't encode  the 'ã'. The default error handler 'strict'
# raises UnicodeEncodeError
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [15]:
# The error='ignore' hanler silently skips characters that cannot
# be encoded; this is usually a very bad idea.
city.encode('cp437', errors='ignore')

b'So Paulo'

In [16]:
# When encoding, error='replace' subsitutes unencodable characters 
# with '?' data is lost, but users will know something is amiss.
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [19]:
# 'xmlcharrefreplace' replaces unencodable characters with XML entity.
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Coping with UnicodeDecodeError

Not every byte holds valid ASCII character, and not every byte sequence is valid UTF8 or UTF16, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.

In [20]:
octets = b'Montr\xe9al'

In [21]:
octets.decode('cp1252')

'Montréal'

In [22]:
octets.decode('iso8859-7')

'Montrιal'

In [23]:
octets.decode('koi8-r')

'MontrИal'

In [24]:
octets.decode('utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [25]:
octets.decode('utf8', errors='replace')

'Montr�al'

In [26]:
octets.decode('utf8', errors='ignore')

'Montral'

### SyntaxError When Loading Moudles with Unexpected Encoding

UTF-8 is default source encodig for Python3, just as ASCII was the default encoding for Python2. If you load a.py module containing non UTF8 data and no encoding declaration, you get a message like this:

    SyntaxError: non UTF-8 code starting with ... ...

In [27]:
# coding: cp1252
print('Olá, Mundo!')

Olá, Mundo!


> Now that Python3 source code is no longer limited to ASCII and defaults to the excellent UTF-8 encoding, the best "fix" for source code in legacy encoding like "cp1252" is convert them to UTF-8.

### How to Discover the Encoding of a Binary Sequence

How do you find the encoding of a byte sequence? Short answer: you can't. You must be told.

Some communication protocols and file formats, like HTTP and XML contain headers that tell us how the content is encoded. You can be sure that some byte streams are not ASCII because they contain byte values over 127, and the way UTF-8 and UTF-16 are built also limits the possible byte sequences.

However, considering that human language are also have their rules and restrictions, once you assume that a stream of bytes is human plain text it may be possible to sniff out its encodig using heuristics and statistics.

### BOM: A Useful Gremlin

In [30]:
u16 = 'El Niño'.encode('utf_16')

In [31]:
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

b'\xff\xfeE' is a BOM, byte-order-mark.

## Handling Text Files

**The best practice of handling text is the "Unicode sandwich". This means that bytes should be decoded to str as early as possible on input(e.g. when opening a file for reading). The "meat" of the sanwich is the business logic of your program, where text handling is done on str objects. You should never encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible. Most web framework work like that, and we rarely tocuh bytes when using them. In Django, your views should output Unicode str; Django itself takes care of encoding the response bytes, using UTf-8 by default.**

![](https://processon.com/chart_image/5f004e947d9c084420496af3.png)

**Python 3 makes it easier to follow the advice of the Unicode sanwich, because the open built-in does the necessary decoding when reading and encoding when writting files in text mode.**

In [33]:
fp = open('cafe.txt', 'w', encoding='utf-8')

In [34]:
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf-8'>

In [35]:
fp.write('café')

4

In [36]:
fp.close()

In [38]:
import os
os.stat('cafe.txt').st_size
# A TextIOWrapper object has an encoding attribute that you can inpsect:
# cp1252 in this case.

5

In [39]:
fp2 = open('cafe.txt')

In [40]:
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='UTF-8'>

In [41]:
fp2.encoding

'UTF-8'

In [42]:
fp2.read()

'café'

In [43]:
fp3 = open('cafe.txt', 'rb')

In [44]:
fp3

<_io.BufferedReader name='cafe.txt'>

In [45]:
fp3.read()

b'caf\xc3\xa9'

> Do not open text files in binary mode unless you need to analyze the file contents to detemine the coding.

### Encoding Defaults

Several settings affects the encoding defaults for I/O in Python.

In [51]:
import sys, locale

expressions = """
    locale.getpreferredencoding()
    type(my_file) 
    my_file.encoding
    sys.stdout.isatty()
    sys.stdout.encoding
    sys.stdin.isatty()
    sys.stdin.encoding 
    sys.stderr.isatty() 
    sys.stderr.encoding 
    sys.getdefaultencoding() 
    sys.getfilesystemencoding()
"""

In [50]:
my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(expression.rjust(30) , '->', repr(value))

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'UTF-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


Note that there are four different encodings:

* If omit the encoding argument when openging a file, the default is given by `locale.getpreferredencoding()`.

* The encoding of sys.stdout/stdin/stderr is given by the `PYTHONENCODING` environment variable if present, otherwise it is either inherited from the console or defined by `locale.getpreferredencoding` if the output/input is redirected to/from a file.

* `sys.getdefualtencoding()` is used internally by Python to convert binary data to/from str; this happens less often in Python3.

* `sys.getfilesystemencoding()` is used to encode/decode filenames(not file contents). It is used when `open()` gets a str argument for the filenames; if the filename is given as a bytes argument, it is passed unchanged to the OS API.


In [52]:
locale.getpreferredencoding()

'UTF-8'

> On GNU/Linux and OSX all of these encodings are set to UTF-8 by default, and have been for several years, so I/O handles all Unicode characters.

**To summarize, the most import encoding settings is that returned by `locale.getpreferredencoding()`: it is the default for opening text files and for sys.stdout/stdin/stderr when then redirected to files.**

In [54]:
help(locale.getpreferredencoding)

Help on function getpreferredencoding in module locale:

getpreferredencoding(do_setlocale=True)
    Return the charset that the user is likely using,
    according to the system configuration.



## Normalize Unicode for Saner Comparisons

String comparisons are complicated by the fact that Unicode has combining characters.

因为Unicode由组合字符，所有字符串比较起来很复杂。

In [56]:
s1 = 'café'
s2 = 'cafe\u0301'
s1, s2

('café', 'café')

In [57]:
len(s1), len(s2)

(4, 5)

In [58]:
s1 == s2

False

The code point U+0301 is the COMBINING ACUTE ACCENT. Using it after "e" renders "é". In the Unicode standard, sequences like 'é' and 'e\u0301' are called "canonical equivalents" and applications are supposed to treat them as the same. But Python sees two different sequences of code points, and considers them not equal.

The solution is to use Unicode normalization, provided by the `unicodedata.normalize` function.

Normalization Form C(NFC) compose the code points to produce the shortest equivalent string, while NFD decompress, expanding composed characters into base characters and separate combining characters.

In [59]:
from unicodedata import normalize

In [60]:
s1 = 'café'
s2 = 'cafe\u0301'
s1, s2

('café', 'café')

In [61]:
len(s1), len(s2)

(4, 5)

In [62]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [63]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [64]:
normalize('NFD', s1) == normalize('NFD', s2)

True

## Sorting with the Unicode Collation Algorithm

Unicode Collation Algorithm(UCA)

In [66]:
import pyuca

In [67]:
coll = pyuca.Collator()

In [68]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']

In [69]:
sorted_fruits = sorted(fruits, key=coll.sort_key)

In [70]:
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

## The Unicode Database

The Unicode standard provide an entire database, that includes not only the table mapping code points to characters names, but also the metadata about the individual characters and how they are releated. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numerical symbol.

In [73]:
import unicodedata
import re
re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'
for char in sample:
    print('U+%04x' % ord(char),
    char.center(6),
    're_dig' if re_digit.match(char) else '-',
    'isdig' if char.isdigit() else '-',
    'isnum' if char.isnumeric() else '-', format(unicodedata.numeric(char), '5.2f'),
    unicodedata.name(char),
    sep='\t')


U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## Dual-Mode str and bytes APIs

The standard library has function that accept str or bytes arguments and hehave differently depending on the type. Some examples are in the re and os modules.

### str Versus bytes in Regular Expression

**If you build a regular expression with `bytes`, patterns such as `\d` and `\w` only match ASCII, if these pattens given as `str`, they match Unicode digits or letters beyond ASCII.**

In [74]:
import re

In [77]:
re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')

In [80]:
text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef" " as 1729 = 13 + 123 = 93 + 103.")
text_bytes = text_str.encode('utf_8')

In [83]:
print('Text', repr(text_str), sep='\n ')

print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes :', re_numbers_bytes.findall(text_bytes))

print('Words')
print(' str :', re_words_str.findall(text_str))
# The str pattern r'\w+' matches the Tamil and ASCII digits.
print(' byte :', re_words_bytes.findall(text_bytes))
# The bytes pattern rb'\w+' matches only the ASCII bytes for letters
# and digits.

Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 13 + 123 = 93 + 103.'
Numbers
 str : ['௧௭௨௯', '1729', '13', '123', '93', '103']
 bytes : [b'1729', b'13', b'123', b'93', b'103']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '13', '123', '93', '103']
 byte : [b'Ramanujan', b'saw', b'as', b'1729', b'13', b'123', b'93', b'103']


### str Versus bytes on os Functions

all `os` module functions that accept filenames or pathnames take arguments as `str` or `bytes`. If one such function is called with a `str` argument, the argument will automatically converted using the codec name by `sys.getfilesystemencoding()`, and the OS response will be decoded with the same codec.

## Chapter Summary

After a brief overivew of the binary sequence data types - bytes, bytearray, and memoview - we jumped into encoding and decoding, with a sampling of important codec, followed by approaches to prevent or deal with the infamous UnicodeEncodeError, UnicodeDecodeError, and the SyntaxError caused by wrong encoding in Python source files.

## References

* [Pragmatic Unicode - or - How Do I Stop the Pain?](https://nedbatchelder.com/text/unipain.html)
* [Unicode HOWTO](https://docs.python.org/3/howto/unicode.html)
* [What is Unicode?](http://unicode.org/standard/translations/s-chinese.html)
* [porting chardet to Python 3](https://diveintopython3.net/case-study-porting-chardet-to-python-3.html)
* [What's New in Python3](https://docs.python.org/3.2/whatsnew/3.0.html)
* [PEP467 - Minro API improvements for binary sequences](https://www.python.org/dev/peps/pep-0467/)
* [Character Model for the World Wide Web: String  Matching and Searching](https://www.w3.org/TR/charmod-norm/)
* [Unicode FAQ](http://www.unicode.org/faq/normalization.html)