# Chapter 4. Unicode Text Versus Bytes

"Humans use text. Computers speak bytes."

This chapter deals with Unicode strings, binary sequences, and the encodings used to convert between them.

In this chapter, we will visit the following topics:

- Characters, code points, and byte representations

- Unique features of binary sequences: bytes, bytearray, and memoryview

- Encodings for full Unicode and legacy character sets

- Avoiding and dealing with encoding errors

- Best practices when handling text files

- The default encoding trap and standard I/O issues

- Safe Unicode text comparisons with normalization

- Utility functions for normalization, case folding, and brute-force diacritic removal

- Proper sorting of Unicode text with locale and the pyuca library

- Character metadata in the Unicode database

- Dual-mode APIs that handle str and bytes

## I. Character Issues
string is a sequence of characters.
the items we get out of a Python 3 str are Unicode characters

In [1]:
s = 'café'
print(len(s))
b = s.encode('utf8')
print(len(b))
print(b)

print(b.decode('utf8'))

4
5
b'caf\xc3\xa9'
café


## II. Byte Essentials

The Python documentation sometimes uses the generic term “byte string” to refer to both `bytes` (immutable) and `bytearray` (mutable)

In [2]:
cafe = bytes('café', encoding='utf_8')
print(cafe)
print(cafe[:1])

b'caf\xc3\xa9'
b'c'


In [3]:
cafe_arr = bytearray(cafe)
print(cafe_arr)
print(cafe_arr[-1:])

bytearray(b'caf\xc3\xa9')
bytearray(b'\xa9')


In [11]:
# Example 4-3. Initializing bytes from the raw data of an array
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) # Typecode 'h' creates an array of short integers (16 bits).
octets = bytes(numbers) # octets holds a copy of the bytes that make up numbers.
octets # These are the 10 bytes that represent the 5 short integers

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## III. Basic Encoders/Decoders

In [12]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'Huy Niño'.encode(codec), sep='\t')

latin_1	b'Huy Ni\xf1o'
utf_8	b'Huy Ni\xc3\xb1o'
utf_16	b'\xff\xfeH\x00u\x00y\x00 \x00N\x00i\x00\xf1\x00o\x00'


## IV. Understanding Encode/Decode Problems

Although there is a generic UnicodeError exception, the error reported by Python is usually more specific: either a UnicodeEncodeError (when converting str to binary sequences) or a UnicodeDecodeError (when reading binary sequences into str)

### 1. Coping with UnicodeEncodeError

In [13]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [14]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [15]:
# cp437 can’t encode the 'ã' (“a” with tilde). The default error handler—'strict'—raises UnicodeEncodeError.
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [16]:
print(city.encode('cp437', errors='ignore'))
print(city.encode('cp437', errors='replace'))
print(city.encode('cp437', errors='xmlcharrefreplace'))

b'So Paulo'
b'S?o Paulo'
b'S&#227;o Paulo'


ASCII is a common subset to all the encodings that I know about, therefore encoding should always work if the text is made exclusively of ASCII characters. Python 3.7 added a new boolean method `str.isascii()` to check whether your Unicode text is 100% pure ASCII. If it is, you should be able to encode it to bytes in any encoding without raising `UnicodeEncodeError`.

### 2. Coping with UnicodeDecodeError

Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.

In [17]:
# Example 4-6. Decoding from str to bytes: success and error handling

octets = b'Montr\xe9al'
print(octets.decode('cp1252'))
print(octets.decode('koi8_r'))

Montréal
MontrИal


In [20]:
# The 'utf_8' codec detects that octets is not valid UTF-8, and raises UnicodeDecodeError.
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [19]:
octets.decode('utf_8', errors='replace')

'Montr�al'

### 3. SyntaxError When Loading Modules with Unexpected Encoding

UTF-8 is the default source encoding for Python 3, just as ASCII was the default for Python 2. If you load a .py module containing non-UTF-8 data and no encoding declaration, you get a message like this:
```
SyntaxError: Non-UTF-8 code starting with '\xe1' in file ola.py on line
  1, but no encoding declared; see https://python.org/dev/peps/pep-0263/
  for details
```

Because `UTF-8` is widely deployed in `GNU/Linux` and `macOS` systems, a likely scenario is opening a .py file created on `Windows` with `cp1252`.

To fix this problem, add a magic coding comment at the top of the file

In [23]:
# coding: cp1252

print('Olá, Mundo!')

Olá, Mundo!


Suppose you have a text file, be it source code or poetry, but you don’t know its encoding. How do you detect the actual encoding? Answers in the next section.

### 4. How to Discover the Encoding of a Byte Sequence

```
$ chardetect 04-text-byte.asciidoc
04-text-byte.asciidoc: utf-8 with confidence 0.99
```

### 5. BOM: A Useful Gremlin

A couple of extra bytes at the beginning of a UTF-16 encoded sequence

In [24]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

The bytes are `b'\xff\xfe'`. That is a BOM—byte-order mark —> biểu thị thứ tự byte “Little-endian” của CPU Intel nơi mã hóa được thực hiện.

In [27]:
# On a little-endian machine, for each code point the least significant byte comes first: the letter 'E', code point U+0045 (decimal 69), is encoded ở vị trí 2 and 3 as `69` and `0`:
# BOM: b'\xff\xfe' (decimal 255, 254).
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

On a big-endian CPU, the encoding would be reversed; 'E' would be encoded as 0 and 69.

Có một biến thể của UTF-16 là
- UTF-16LE, một little-endian (123456789 -> 15 cd 5b 07)
- UTF-16BE, một big-endian (123456789 -> 07 5b cd 15)
https://viblo.asia/p/little-endian-vs-big-endian-E375z0pWZGW

If you use them, a BOM is not generated:

In [28]:
u16le = 'El Niño'.encode('utf_16le')
list(u16le)

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [29]:
u16le = 'El Niño'.encode('utf_16be')
list(u16le)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

## V. Handling Text Files in Python 3

The best practice for handling text I/O is the “Unicode sandwich” (Figure 4-2).5 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading).

![img.png](4-8.png)

In [31]:
# Example 4-8. A platform encoding issue (if you try this on your machine, you may or may not see the problem)

open('cafe.txt', 'w', encoding='utf_8').write('café')

open('cafe.txt').read()

# Running on recent GNU/Linux or macOS work perfectly well because their default encoding is UTF-8,
# But in Windows 10, it failt and return 'cafÃ©'

'café'

### Beware of Encoding Defaults


In [32]:
# On Mac/Linux
import locale
import sys

expressions = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
    """

my_file = open('dummy', 'w')

for expression in expressions.split():
    value = eval(expression)
    print(f'{expression:>30} -> {value!r}')

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


On Windows 10

```
locale.getpreferredencoding() -> 'cp1252'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp1252'
           sys.stdout.isatty() -> True
           sys.stdout.encoding -> 'utf-8'
            sys.stdin.isatty() -> True
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> True
           sys.stderr.encoding -> 'utf-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'
```

## Normalizing Unicode for Reliable Comparisons

# Pending cause not feel interested