Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes. 

### Character Issues
The concept of “string” is simple enough: a string is a sequence of characters. The problem lies in the definition of “character.” 
In 2015, the best definition of “character” we have is a Unicode character. 

The identity of a character—its code point—is a number from 0 to 1,114,111 (base 10), shown in the Unicode standard as 4 to 6 hexadecimal digits with a “U+” prefix.

The actual bytes that represent a character depend on the encoding in use. An encoding is an algorithm that converts code points to byte sequences and vice versa.


The string cafe has four unicode characters.

In [4]:
s = 'café'
len(s)

4

Encoding str to bytes using UTF-8 encoding and byte literal starts with b.

In [5]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [6]:
len(b)

5

Bytes b has five bytes(the code point for “é” is encoded as two bytes in UTF-8).
Decoding using UTF-8

In [8]:
b.decode('utf8')

'café'

### Byte Essentials

The new binary sequence types are unlike the Python 2 str in many regards. The first thing to know is that there are two basic built-in types for binary sequences: the immutable bytes type introduced in Python 3 and the mutable bytearray, added in Python 2.6.

Each item in bytes or bytearray is an integer from 0 to 255, and not a one-character string like in the Python 2 str. However, a slice of a binary sequence always produces a binary sequence of the same type


In [9]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

Building bytes from str by mentioning encoding.

In [10]:
cafe[0]

99

Each item is an integer in range(256)

In [11]:
cafe[:1]

b'c'

Slices return bytes even single byte slices.

In [12]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

There is no literal syntax for bytearray: they are shown as `bytearray()` with a bytes literal as argument.

In [13]:
cafe_arr[-1:]

bytearray(b'\xa9')

Slice of bytearray is bytearray as well.

Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them. Therefore, three different displays are used, depending on each byte value:
1. For bytes in the printable ASCII range—from space to ~—the ASCII character itself is used.
2. For bytes corresponding to tab, newline, carriage return, and \, the escape sequences \t, \n, \r, and \\\ are used.
3. For every other byte value, a hexadecimal escape sequence is used (e.g., \x00 is the null byte).

Binary sequences have a class method that str doesn’t have, called fromhex, which builds a binary sequence by parsing pairs of hex digits optionally separated by spaces:

In [14]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

The other ways of building bytes or bytearray instances are calling their constructors with:
1. A str and an encoding keyword argument.
2. An iterable providing items with values from 0 to 255.
3. An object that implements the buffer protocol (e.g., `bytes`, `bytearray`, `memoryview`, `array.array`); this copies the bytes from the source object to the newly created binary sequence. see example below:

In [16]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

Typecode 'h' stores short integers (16 bits). octets hold the bytes that make up the number. The output shows the 10 bytes that represent these 5 numbers.

### Structs and Memory Views
The struct module provides functions for parsing packed bytes into a tuple of fields of different types and perform opposite conversion as well.
`struct` is used with `bytes`, `bytearray`, and `memoryview` objects.

The `memoryview` class does not let you create or store byte sequences, but provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as Python Imaging Library (PIL) images, without copying the bytes.

In [17]:
import struct
fmt = '<3s3sHH'

**struct format**: < little-endian; 3s3s two sequences of 3 bytes; HH two 16-bit integers.

In [18]:
with open('assets/giphy.gif', 'rb') as fp:
    img = memoryview(fp.read())

Created memoryview from file contents in memory.

In [19]:
header = img[:10]

Created another memorview by slicing the first one.

In [20]:
bytes(header)

b'GIF89a,\x01\xe1\x00'

Convert to bytes for the first 10 bytes slice of the original memoryview.

In [21]:
struct.unpack(fmt, header)

(b'GIF', b'89a', 300, 225)

Unpacked memoryview into a tuple of fields type, version, width and height.

In [22]:
del header 
del img

Deletes references to release the memory associated with the memoryview instances.

### Basic Encoders/Decoders

The Python distribution bundles more than 100 codecs (encoder/decoder) for text to byte conversion and vice versa. Each codec has a name, like '`utf_8`', and often aliases, such as '`utf8`', '`utf-8`', and '`U8`', which you can use as the encoding argument in functions like `open()`, `str.encode()`, `bytes.decode()`, and so on.

In [23]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Nino'.encode(codec), sep='\t')

latin_1	b'El Nino'
utf_8	b'El Nino'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00n\x00o\x00'


`latin1` a.k.a. `iso8859_1`
Important because it is the basis for other encodings, such as cp1252 and 

`Unicode`
itself (note how the latin1 byte values appear in the cp1252 bytes and even in the code points).

`cp1252`
A latin1 superset by Microsoft, adding useful symbols like curly quotes and the € (euro); some Windows apps call it “ANSI,” but it was never a real ANSI standard.

`cp437`
The original character set of the IBM PC, with box drawing characters. Incompatible with latin1, which appeared later.

`gb2312`
Legacy standard to encode the simplified Chinese ideographs used in mainland
China; one of several widely deployed multibyte encodings for Asian languages.

`utf-8`
The most common 8-bit encoding on the Web, by far; backward-compatible with
ASCII (pure ASCII text is valid UTF-8).

`utf-16le`
One form of the UTF-16 16-bit encoding scheme; all UTF-16 encodings support
code points beyond U+FFFF through escape sequences called “surrogate pairs.”

### Understanding Encode/Decode Problems
Although there is a generic UnicodeError exception, the error reported is almost always more specific: either a `UnicodeEncodeError` (when converting str to binary sequences) or a `UnicodeDecodeError` (when reading binary sequences into str). 

#### Coping with UnicodeEncodeError
Most non unicode codecs handle only small subset of unicode points. While converting text to bytes if the cahracter is not present in target encoding, UnicodeEncodeError will be raised unless special handling is provided.

In [24]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

`utf_?` encoding can handle any str.

In [25]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [26]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

'`iso8859_1`' also works for the 'São Paulo' str.

In [27]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

'`cp437`' can't encode 'ã' (“a” with tilde). The default error handler 'strict' raises UnicodeEncodeError.

In [28]:
city.encode('cp437', errors='ignore')

b'So Paulo'

The `error='ignore'` handler silently skips characters that cannot be encoded;
this is usually a very bad idea

In [30]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

When encoding, `error='replace'` substitutes unencodable characterswith '`?`';
data is lost, but users will know something is amiss.

In [31]:
city.encode('cp437', errors='xmlcharrefreplace') 

b'S&#227;o Paulo'

'`xmlcharrefreplace`' replaces unencodable characters with a XML entity.

#### Coping with UnicodeDecodeError
Not every byte holds a valid ASCII character, and not every byte sequence is valid UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting a binary sequence to text, you will get a UnicodeDecodeError if unexpected bytes are found.

On the other hand, many legacy 8-bit encodings like '`cp1252`', '`iso8859_1`', and '`koi8_r`' are able to decode any stream of bytes, including random noise, without generating errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

In [32]:
octets = b'Montr\xe9al'
octets.decode('cp1252')

'Montréal'

These bytes are the characters for “Montréal” encoded as `latin1`; '\xe9' is the byte for “é”.
Decoding with '`cp1252`' (Windows 1252) works because it is a proper superset
of latin1.

In [33]:
octets.decode('iso8859_7')

'Montrιal'

In [34]:
octets.decode('koi8_r')

'MontrИal'

KOI8-R is for Russian. Now '\xe9' stands for the Cyrillic letter “И”

In [35]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

The '`utf_8`' codec detects that octets is not valid UTF-8, and raises `UnicodeDecodeError`.

In [36]:
octets.decode('utf_8', errors='replace')

'Montr�al'

Using '`replace`' error handling, the `\xe9` is replaced by “�” (code point U
+FFFD), the official Unicode REPLACEMENT CHARACTER intended to represent
unknown characters