<a href="https://colab.research.google.com/github/present42/PyTorchPractice/blob/main/Fluent_Python_ch4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Python 3 introduced a sharp distinction between strings of human text and sequence of raw bytes.

Main topic
 - Unicode strings
 - binary sequences
 - encoding used to convert between them

The Unicode standard explicitly separates the identity of characters from specific byte representations:
 * The identity of a character--its code point--is a number from 0, to 1,114,111 shown in the Unicode as 4 to 6 hex digits with a "U+" prefix.
 * Actual bytes that represent a character depend on the *encoding* in use

In [None]:
s = 'café' # str café has 4 unicode characters
len(s)

4

In [None]:
b = s.encode('utf8') # Encode str to bytes using UTF-8 encoding
b

b'caf\xc3\xa9'

In [None]:
len(b)

5

In [None]:
b.decode('utf8')

'café'

## Byte Essentials
 1. There are 2 basic built-in types for binary sequences: immutable `bytes` type and mutable `bytearray`.
 2. Each item in `bytes` or `bytearray` is an integer from 0 to 255 and not a one-character string like in the Python 2 `str`.

In [None]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [None]:
cafe[0] # each itemr is an integer in range(256)

99

In [None]:
cafe[:1] # slices of bytes are also bytes

b'c'

In [None]:
cafe_arr = bytearray(cafe)
cafe_arr # no literal syntax for bytearray

bytearray(b'caf\xc3\xa9')

In [None]:
cafe_arr[-1:] # slices of bytearray are also bytearray

bytearray(b'\xa9')

Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them.
 - For bytes with decimal codes 32 to 126 (from space to `~`), the ASCII character itself is used
 - For bytes corresponding to tab, newline, carriage return, and `\` the escape sequences `\t`, `\n`, `\r`, `\\` are used.
 - If both string delimiters `'` and `"` appear in the byte sequences, the whole sequence is delimited by `'`, and andy `'` inside are escaped as `\'`

In [None]:
test = "Hi there, \'test for encoding''"
bytes(test, 'utf_8')

b"Hi there, 'test for encoding''"

### Note
Both `bytes` and `bytearray` support every `str` method except those that do formatting and those that depend on Unicode data. In addition, the regular expression functions in the `re` module also work on binary sequences.

Binary sequences have a class method that `str` doesn't have, called `fromhex`, which builds a binary sequence by parsing pairs of hex digits optionally separted by spaces.

Another way of building `bytes` or `bytearray`:
 1. An iterable providing items with values from 0 to 255
 2. An object that implements the buffer protocol that copies the bytes from the source object to the newly created binary sequence.

In [None]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

In [None]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) # Typecode 'h' creates an array of short integers (16 bits = 2 byte)
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## Basic Encoders / Decoders
The python distribution bundles more than 100 codecs for text to byte conversion and vice versa.

In [None]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
  print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


In [None]:
print("气".encode('latin1'))

UnicodeEncodeError: 'latin-1' codec can't encode character '\u6c14' in position 0: ordinal not in range(256)

In [1]:
city = 'São Paulo'
city.encode('utf-8')

b'S\xc3\xa3o Paulo'

In [2]:
city.encode('utf-16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [4]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

In [5]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [6]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [7]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

`charrefreplace` replaces unencodable characters with an XML entity. If you can't use UTF and you can't afford to lose data, this is the only option

In [8]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

In [9]:
city.isascii()

False

In [11]:
"hello!@#$%^*".isascii()

True

Example of how using the wrong codec may produce gremlins or a `UnicodeDecodeError`

In [12]:
octets = b'Montr\xe9al' # encoded as latin1
octets.decode('cp1252') # works as intended because cp1252 is a superset of latin1

'Montréal'

In [13]:
octets.decode('iso8859_7') # intended for Greek so it was misinterpreted

'Montrιal'

In [15]:
octets.decode('koi8_r') # intended for Russian so it was misinterpreted

'MontrИal'

In [16]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [17]:
octets.decode('utf_8', errors='replace')

'Montr�al'

 - UTF-8 is the default source encoding for Python 3
 - ASCII is the default source encoding for Python 2

In [18]:
# coding: cp1252

print('Olá Mundo')

Olá Mundo


## Q. How do you find the encoding of a byte sequence?
No, you can't. You must be told.
Ex. HTTP, XML contain headers that explicitly tell us how the content is encoded.

In [20]:
' '.encode('utf-8')

b'\xff\xfe \x00'

In [21]:
u16 = 'El Niño'.encode('utf-16')
u16


b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

The bytes are `b'\xff\xfe'`. This is BOM-byte-order mark (denoting little-endian byt ordering of the Intel CPU).

In [22]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [23]:
u16le = 'El Niño'.encode('utf-16le') #little endian
list(u16le) # BOM is supposed to be filtered by the UTF16 codec

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [24]:
u16be = 'El Niño'.encode('utf-16be') #big endian
list(u16be)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]