<a href="https://colab.research.google.com/github/present42/PyTorchPractice/blob/main/Fluent_Python_ch4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Python 3 introduced a sharp distinction between strings of human text and sequence of raw bytes.

Main topic
 - Unicode strings
 - binary sequences
 - encoding used to convert between them

The Unicode standard explicitly separates the identity of characters from specific byte representations:
 * The identity of a character--its code point--is a number from 0, to 1,114,111 shown in the Unicode as 4 to 6 hex digits with a "U+" prefix.
 * Actual bytes that represent a character depend on the *encoding* in use

In [None]:
s = 'café' # str café has 4 unicode characters
len(s)

4

In [None]:
b = s.encode('utf8') # Encode str to bytes using UTF-8 encoding
b

b'caf\xc3\xa9'

In [None]:
len(b)

5

In [None]:
b.decode('utf8')

'café'

## Byte Essentials
 1. There are 2 basic built-in types for binary sequences: immutable `bytes` type and mutable `bytearray`.
 2. Each item in `bytes` or `bytearray` is an integer from 0 to 255 and not a one-character string like in the Python 2 `str`.

In [None]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [None]:
cafe[0] # each itemr is an integer in range(256)

99

In [None]:
cafe[:1] # slices of bytes are also bytes

b'c'

In [None]:
cafe_arr = bytearray(cafe)
cafe_arr # no literal syntax for bytearray

bytearray(b'caf\xc3\xa9')

In [None]:
cafe_arr[-1:] # slices of bytearray are also bytearray

bytearray(b'\xa9')

Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them.
 - For bytes with decimal codes 32 to 126 (from space to `~`), the ASCII character itself is used
 - For bytes corresponding to tab, newline, carriage return, and `\` the escape sequences `\t`, `\n`, `\r`, `\\` are used.
 - If both string delimiters `'` and `"` appear in the byte sequences, the whole sequence is delimited by `'`, and andy `'` inside are escaped as `\'`

In [None]:
test = "Hi there, \'test for encoding''"
bytes(test, 'utf_8')

b"Hi there, 'test for encoding''"

### Note
Both `bytes` and `bytearray` support every `str` method except those that do formatting and those that depend on Unicode data. In addition, the regular expression functions in the `re` module also work on binary sequences.

Binary sequences have a class method that `str` doesn't have, called `fromhex`, which builds a binary sequence by parsing pairs of hex digits optionally separted by spaces.

Another way of building `bytes` or `bytearray`:
 1. An iterable providing items with values from 0 to 255
 2. An object that implements the buffer protocol that copies the bytes from the source object to the newly created binary sequence.

In [None]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

In [None]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) # Typecode 'h' creates an array of short integers (16 bits = 2 byte)
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

## Basic Encoders / Decoders
The python distribution bundles more than 100 codecs for text to byte conversion and vice versa.

In [None]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
  print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


In [None]:
print("气".encode('latin1'))

UnicodeEncodeError: 'latin-1' codec can't encode character '\u6c14' in position 0: ordinal not in range(256)

In [None]:
city = 'São Paulo'
city.encode('utf-8')

b'S\xc3\xa3o Paulo'

In [None]:
city.encode('utf-16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [None]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

In [None]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [None]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [None]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

`charrefreplace` replaces unencodable characters with an XML entity. If you can't use UTF and you can't afford to lose data, this is the only option

In [None]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

In [None]:
city.isascii()

False

In [None]:
"hello!@#$%^*".isascii()

True

Example of how using the wrong codec may produce gremlins or a `UnicodeDecodeError`

In [None]:
octets = b'Montr\xe9al' # encoded as latin1
octets.decode('cp1252') # works as intended because cp1252 is a superset of latin1

'Montréal'

In [None]:
octets.decode('iso8859_7') # intended for Greek so it was misinterpreted

'Montrιal'

In [None]:
octets.decode('koi8_r') # intended for Russian so it was misinterpreted

'MontrИal'

In [None]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [None]:
octets.decode('utf_8', errors='replace')

'Montr�al'

 - UTF-8 is the default source encoding for Python 3
 - ASCII is the default source encoding for Python 2

In [None]:
# coding: cp1252

print('Olá Mundo')

Olá Mundo


## Q. How do you find the encoding of a byte sequence?
No, you can't. You must be told.
Ex. HTTP, XML contain headers that explicitly tell us how the content is encoded.

In [None]:
' '.encode('utf-8')

b'\xff\xfe \x00'

In [None]:
u16 = 'El Niño'.encode('utf-16')
u16


b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

The bytes are `b'\xff\xfe'`. This is BOM-byte-order mark (denoting little-endian byt ordering of the Intel CPU).

In [None]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [None]:
u16le = 'El Niño'.encode('utf-16le') #little endian
list(u16le) # BOM is supposed to be filtered by the UTF16 codec

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [None]:
u16be = 'El Niño'.encode('utf-16be') #big endian
list(u16be)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

## Handling Text Files
 - "Unicode Sandwich": `bytes` should be decoded to `str` as early as possible on input. "filling" is the business logic of your program. We should never be encoding or decoding in the middle of other processing.

In [None]:
# specified UTF-8 encoding when writing the file
open('cafe.txt', 'w', encoding='utf-8').write('café')

4

In [None]:
# (maybe) fail to use utf-8 encoding
open('cafe.txt').read()

'café'

In [None]:
fp = open('cafe.txt', 'w', encoding='utf_8') # By default, open uses text mode and returns a TextIOWrapper obj with specific encoding

In [None]:
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

In [None]:
fp.write('café') # method on a TextIOWrapper returns the numbere of Unicode characters written

4

In [None]:
fp.close()

In [None]:
import os
os.stat('cafe.txt').st_size # os.stat says the file has 5 bytes; UTF-8 encodes é as w bytes

5

In [None]:
fp2 = open('cafe.txt')

In [None]:
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='UTF-8'>

In [None]:
fp2.encoding

'UTF-8'

In [None]:
fp2.read()

'café'

In [None]:
fp4 = open('cafe.txt', 'rb') # Do not open text files in binary mode unless you need to anlayze the file contents

In [None]:
fp4

<_io.BufferedReader name='cafe.txt'>

In [None]:
fp4.read()

b'caf\xc3\xa9'

## Beware of Encoding Defaults



In [None]:
import locale
import sys

expressions = """
        locale.getpreferredencoding()
        type(my_file)
        my_file.encoding
        sys.stdout.isatty()
        sys.stdout.encoding
        sys.stdin.isatty()
        sys.stdin.encoding
        sys.stderr.isatty()
        sys.stderr.encoding
        sys.getdefaultencoding()
        sys.getfilesystemencoding()
    """

In [None]:
my_file = open("dummy", "w")
for exp in expressions.split():
  value = eval(exp)
  print(f"{exp:>30} -> {value!r}")

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


In [None]:
import sys
from unicodedata import name

print(sys.version)
print()
print('sys.stdout.isatty():', sys.stdout.isatty())
print('sys.stdout.encoding():', sys.stdout.encoding)
print()

test_chars = [
    '\N{HORIZONTAL ELLIPSIS}',
    '\N{INFINITY}',
    '\N{CIRCLED NUMBER FORTY TWO}',
]

for char in test_chars:
  print(f"Trying to output {name(char)}:")
  print(char)

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

sys.stdout.isatty(): False
sys.stdout.encoding(): UTF-8

Trying to output HORIZONTAL ELLIPSIS:
…
Trying to output INFINITY:
∞
Trying to output CIRCLED NUMBER FORTY TWO:
㊷


## Normalizing Unicode for Reliable Comparisons

In [None]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'

In [None]:
s1, s2

('café', 'café')

In [None]:
len(s1), len(s2)

(4, 5)

In [None]:
s1 == s2

False

In [None]:
from unicodedata import normalize

len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [None]:
len(normalize('NFD', s1))

5

In [None]:
normalize('NFD', s1) == normalize('NFD', s2)
normalize('NFC', s1) == normalize('NFC', s2)

True

Note. Keyboard driver usually generate composed characters, so text typed by users will be in NFC by default.

In [None]:
from unicodedata import normalize, name

In [None]:
ohm = '\u2126'

In [None]:
name(ohm)

'OHM SIGN'

In [None]:
ohm_c = normalize('NFC', ohm)

In [None]:
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [None]:
ohm_c == ohm

False

In [None]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

In [None]:
half = '\N{VULGAR FRACTION ONE HALF}'

In [None]:
print(half)

½


In [None]:
normalize('NFKC', half)

'1⁄2'

In [None]:
for char in normalize('NFKC', half):
  print(char, name(char), sep='\t')


1	DIGIT ONE
⁄	FRACTION SLASH
2	DIGIT TWO


In [None]:
four_squared="4²"
normalize('NFKC', four_squared)

'42'

In [None]:
micro='µ'
micro_kc = normalize('NFKC', micro)
micro, micro_kc

('µ', 'μ')

In [None]:
ord(micro), ord(micro_kc)

(181, 956)

In [None]:
name(micro), name(micro_kc)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

In [None]:
name(micro)

'MICRO SIGN'

In [None]:
# convert all text to lowercase
micro_cf = micro.casefold()
micro, micro_cf

('µ', 'μ')

In [None]:
name(micro_cf)

'GREEK SMALL LETTER MU'

In [None]:
eszett = 'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [None]:
eszett_cf = eszett.casefold()

In [None]:
eszett, eszett_cf

('ß', 'ss')

In [None]:
def nfc_equal(str1, str2):
  return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
  return normalize('NFC', str1).casefold() == normalize('NFC', str2).casefold()

In [None]:
import string
import unicodedata

def shave_marks(txt):
  norm_txt = unicodedata.normalize('NFD', txt)
  shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))

  return unicodedata.normalize('NFC', shaved)

In [None]:
shave_marks("café")

'cafe'

In [None]:
order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
shave_marks(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [None]:
Greek = "Ζέφυρος, Zéfiro"
shave_marks(Greek)

'Ζεφυρος, Zefiro'

In [None]:
def shave_marks_latin(txt):
  norm_txt = unicodedata.normalize('NFD', txt)
  latin_base = False
  preserve = []
  for c in norm_txt:
    if unicodedata.combining(c) and latin_base:
      continue # ignore diacritic on Latin base char
    preserve.append(c)
    if not unicodedata.combining(c):
      latin_base = c in string.ascii_letters
  shaved = ''.join(preserve)
  return unicodedata.normalize('NFC', shaved)

In [None]:
shave_marks_latin(Greek)

'Ζέφυρος, Zefiro'

In [None]:
shave_marks_latin(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [None]:
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–—˜›""",
                           """'f"^<''""---~>""")

multi_map = str.maketrans({
    '€': 'EUR',
    '…': '...',
    'Æ': 'AE',
    'æ': 'ae',
    'Œ': 'OE',
    'œ': 'oe',
    '™': '(TM)',
    '‰': '<per mille>',
    '†': '**',
    '‡': '***',
})

In [None]:
multi_map.update(single_map)

In [None]:
def dewinize(txt):
  return txt.translate(multi_map)

In [None]:
type(multi_map)

dict

In [None]:
def asciize(txt):
  no_marks = shave_marks_latin(dewinize(txt))
  no_marks = no_marks.replace('ß', 'ss')
  return unicodedata.normalize('NFKC', no_marks)

In [None]:
dewinize(order)

'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'

In [None]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'

## Sorting Unicode Text

In [2]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

Standard way to sort non-ASCII text in Python is to use `locale.strxfrm` function which transforms a string to one that can be used in locale-aware comparisons

In [10]:
!apt-get install language-pack-pt

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  language-pack-pt-base
The following NEW packages will be installed:
  language-pack-pt language-pack-pt-base
0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
Need to get 2,993 kB of archives.
After this operation, 15.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 language-pack-pt-base all 1:22.04+20240212 [2,991 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 language-pack-pt all 1:22.04+20240212 [1,896 B]
Fetched 2,993 kB in 2s (1,203 kB/s)
Selecting previously unselected package language-pack-pt-base.
(Reading database ... 121753 files and directories currently installed.)
Preparing to unpack .../language-pack-pt-base_1%3a22.04+20240212_all.deb ...
Unpacking language-pack-pt-base (1:22.04+20240212) ...
Selecting previously unselected packag

In [1]:
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)

pt_BR.UTF-8


In [4]:
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']


* Easier solution that works on Linux, Mac, Windows

In [6]:
!pip install pyuca

Collecting pyuca
  Downloading pyuca-1.2-py2.py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyuca
Successfully installed pyuca-1.2


In [8]:
import pyuca
coll = pyuca.Collator()
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

In [9]:
from unicodedata import name
print(name('A'))
print(name('ã'))
print(name('♛'))
print(name('😸'))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER A WITH TILDE
BLACK CHESS QUEEN
GRINNING CAT FACE WITH SMILING EYES


In [13]:
# !/usr/bin/env python3
import sys
import unicodedata

# sets defaults for the range of code points to search
START, END = ord(' '), sys.maxunicode + 1

# find accepts query_words and optional keyword-only arguments to
# limit the range of search
def find(*query_words, start=START, end=END):
  query = {w.upper() for w in query_words}
  for code in range(start, end):
    char = chr(code)
    name = unicodedata.name(char, None)
    if name and query.issubset(name.split()):
      print(f"U+{code:04X}\t{char}\t{name}")

def main(words):
  if words:
    find(*words)
  else:
    print('Please provide words to find.')

if __name__ == '__main__':
  main(sys.argv[1:])


In [24]:
!python ./cf.py dog

U+2EA8	⺨	CJK RADICAL DOG
U+2F5D	⽝	KANGXI RADICAL DOG
U+B3C5	독	HANGUL SYLLABLE DOG
U+1F32D	🌭	HOT DOG
U+1F415	🐕	DOG
U+1F436	🐶	DOG FACE
U+1F9AE	🦮	GUIDE DOG
