# Book Meeting - Fluent Python
## Chapter 4 - Text versus Bytes
### "Humans use text. Computers speak bytes", Esther Nam and Travis Fischer

# What is a string?

## Sequence of characters!?

# But... What is a character?

## We can define it in Python 3 as one Unicode Symbol

# Identity of a character
## Code point: number from 0 to 1,114,111 (base 10)
### show in the Unicode standard 4 to 6 hex digits. Ex: U+10FFFF
### 12% already assigned in Unicode 12.1 (the standard used in Python 3.8)

# Encoding/Decoding

## The actual bytes that represent a character depend on the encoding in use
### Encoding: code points -> byte sequences
### Decoding: bytes sequences -> code points

# Encondings may have different code points for the same symbol

In [83]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


# Another example

![MutableMapping diagram](images/encoding_table.png)

# Common problems

In [84]:
with open("cafe.txt", "w", encoding="utf_8") as fp:
    fp.write('café')

In [85]:
with open("cafe.txt", "r", encoding="cp1252") as fp:
    print(fp.read())

cafÃ©


In [86]:
with open("cafe.txt", "r") as fp:
    print(fp.read())

café


In [87]:
with open("cafe.txt", "rb") as fp:
    print(fp.read())

b'caf\xc3\xa9'


In [88]:
with open("cafe.txt", mode="rb") as fp:
    print(fp.read())

b'caf\xc3\xa9'


In [89]:
with open("cafe.txt", mode="w", encoding="cp1252") as fp:
    fp.write("café")
with open("cafe.txt", mode="rb") as fp:
    print(fp.read())

b'caf\xe9'


In [90]:
print(bytes("\xe9".encode("utf_8")))

b'\xc3\xa9'


In [91]:
print(bytes("\xc3\xa9".encode("cp1252")))

b'\xc3\xa9'


# ftfy library
## Heuristic to guess the enconding and fix the text

In [92]:
from ftfy import fix_text

text_to_fix = "cafÃ©"

print(f"Text before: {text_to_fix}")

print(f"Text after: {fix_text(text_to_fix)}")

Text before: cafÃ©
Text after: café


In [93]:
from unicodedata import name, normalize

char_symbol = '\U0001F4A9'


In [94]:
print(name(char_symbol))

PILE OF POO


In [95]:
print(normalize('NFC', char_symbol))

💩


# The Unicode Sandwich

![The Unicode Sandwich](images/unicode_sandwich.png)