# Chapter 4: Unicode Text Versus Bytes

In this chapter, we will visit the following topics:
‚Ä¢ Characters, code points, and byte representations
‚Ä¢ Unique features of binary sequences: bytes, bytearray, and memoryview
‚Ä¢ Encodings for full Unicode and legacy character sets
‚Ä¢ Avoiding and dealing with encoding errors
‚Ä¢ Best practices when handling text files
‚Ä¢ The default encoding trap and standard I/O issues
‚Ä¢ Safe Unicode text comparisons with normalization
‚Ä¢ Utility functions for normalization, case folding, and brute-force diacritic
removal
‚Ä¢ Proper sorting of Unicode text with locale and the pyuca library
‚Ä¢ Character metadata in the Unicode database
‚Ä¢ Dual-mode APIs that handle str and bytes

## Unicode

In [1]:
# In Unicode, characters are represented with code points

s = 'caf√©'
b = s.encode('utf-8')
len(s), b, len(b), b.decode('utf-8') # √© is two bytes in unicode

(4, b'caf\xc3\xa9', 5, 'caf√©')

In [None]:
# bytes and bytearrays - bytes are between 0 (0x00) and 255 (0xFF)
cafe = bytes('caf√©', encoding='utf_8') 
cafe_arr = bytearray(cafe)

cafe, cafe_arr, cafe[0], cafe[:1], cafe_arr[-1:]

# bytes and bytearrays support string methods that don't relate to formatting and Unicode data

(b'caf\xc3\xa9', bytearray(b'caf\xc3\xa9'), 99, b'c', bytearray(b'\xa9'))

In [6]:
# there are over a 100 codecs (encode/decode) in Python for text to byte / byte to text conversion

for codec in ('latin_1', 'utf_8', 'utf_16'):
    print(codec, 'El Ni√±o'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


In [7]:
# Error handling

# UnicodeEncodeError
city = 'S√£o Paulo'

city.encode('utf_8'), city.encode('utf_16'), city.encode('iso8859_1')

(b'S\xc3\xa3o Paulo',
 b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00',
 b'S\xe3o Paulo')

In [None]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [None]:
city.encode('cp437', errors="ignore"), city.encode('cp437', errors="replace"), city.encode('cp437', errors="xmlcharrefreplace")
# encoding should always work if your characters are purely ASCII

(b'So Paulo', b'S?o Paulo', b'S&#227;o Paulo')

## Coping with Errors

In [None]:
# UnicodeDecodeError

octets = b'Montr\xe9al' 

octets.decode('cp1252'), octets.decode('iso8859_7'), octets.decode('koi8_r') # the last two can decode, but since the bytes map to something else
# in that codec, it outputs something random

('Montr√©al', 'MontrŒπal', 'Montr–òal')

In [13]:
octets.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [14]:
octets.decode('utf-8', errors="replace")

'MontrÔøΩal'

In [None]:
# you can't exactly determine the encoding of some given bytes, but you can use the chardet library to determine it heuristically

### Handling Text Files

In [17]:
# Handling text files

# Usually, you want to decode from bytes to str as early as possible, then encode back to bytes as late as possible.
# This is known as the Unicode sandwich

# Check out this error that occurs on Windows
fp = open('cafe.txt', 'w', encoding='utf-8')
fp, fp.write('caf√©'), fp.close()

(<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf-8'>, 4, None)

In [None]:
import os
os.stat('cafe.txt').st_size # 5 since utf-8 encodes √© as 2 bytes

5

In [21]:
fp2 = open('cafe.txt', encoding='cp1252') # on Windows, cp1252 is the default, so this explicit encoding wouldn't be needed
fp2, fp2.read()

(<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>, 'caf√É¬©')

In [22]:
fp3 = open('cafe.txt', encoding='utf-8') # on Mac, utf-8 is the default, so this explicit wouldn't be needed.
# you should always pass an explicit encoding to get rid of these bugs

fp3.read()

'caf√©'

In [None]:
fp4 = open('cafe.txt', 'rb')
fp4, fp4.read() # bytes

(<_io.BufferedReader name='cafe.txt'>, b'caf\xc3\xa9')

In [None]:
import locale
import sys
expressions = """
 locale.getpreferredencoding()
 type(my_file)
 my_file.encoding
 sys.stdout.isatty()
 sys.stdout.encoding
 sys.stdin.isatty()
 sys.stdin.encoding
 sys.stderr.isatty()
 sys.stderr.encoding
 sys.getdefaultencoding()
 sys.getfilesystemencoding()
 """

my_file = open('cafe.txt', 'w')

for exp in expressions.split():
    value = eval(exp)
    print(f'{exp:>30} -> {value!r}') # on Windows, the locale and my_file.encoding would be cp1252.

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


## Normalizing Unicode

In [None]:
# normalizing unicode 

s1 = 'caf√©'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'

# s1 and s2 are not the same because of the way they're constructed, even though they look the same
s1, s2, len(s1), len(s2), s1 == s2

('caf√©', 'cafeÃÅ', 4, 5, False)

In [None]:
from unicodedata import normalize

# NFD decomposes, NFC combines
# by default, keyboards will make composed characters, so they should be NFC
normalize('NFD', s1) == normalize('NFD', s2), normalize('NFC', s1) == normalize('NFC', s2) # NFC combines

(True, True)

In [31]:
from unicodedata import name

ohm = '\u2126'
ohm_c = normalize('NFC', ohm)
name(ohm), name(ohm_c), ohm == ohm_c

('OHM SIGN', 'GREEK CAPITAL LETTER OMEGA', False)

In [33]:
# NFKC and NFKD are stricter normalizations, with K standing for "compatibility"

half = '\N{VULGAR FRACTION ONE HALF}'
half, normalize('NFKC', half)

('¬Ω', '1‚ÅÑ2')

In [34]:
for char in normalize('NFKC', half):
    print(char, name(char), sep='\t')

1	DIGIT ONE
‚ÅÑ	FRACTION SLASH
2	DIGIT TWO


In [36]:
# case folding
# almost the same as .lower() but with some special cases.

eszett = '√ü'
eszett_cf = eszett.casefold()
name(eszett), eszett, eszett_cf

('LATIN SMALL LETTER SHARP S', '√ü', 'ss')

In [37]:
# when working with text in many languages, you can use
# functions like these to compare text

def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
        normalize('NFC', str2).casefold())

s3 = 'Stra√üe'
s4 = 'strasse'

nfc_equal(s3, s4), fold_equal(s3, s4)

(False, True)

In [1]:
# one hack is to remove diacritics (accents, cedillas)
# this changes the meaning of the word but can help with user-facing stuff
# like google search, since realistically, users aren't going to use
# accents much

import unicodedata
import string

def remove_marks(txt):
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt
                     if not unicodedata.combining(c)) # filter out combining marks
    return unicodedata.normalize('NFC', shaved)

order = '‚ÄúHerr Vo√ü: ‚Ä¢ ¬Ω cup of ≈ítker‚Ñ¢ caff√® latte ‚Ä¢ bowl of a√ßa√≠.‚Äù'
remove_marks(order)

'‚ÄúHerr Vo√ü: ‚Ä¢ ¬Ω cup of ≈ítker‚Ñ¢ caffe latte ‚Ä¢ bowl of acai.‚Äù'

In [3]:
# a function that might make more sense is to remove attached marks
# only if the base character is from the Latin alphabet

def remove_marks_latin(txt):
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False

    preserve = []
    for c in norm_txt:
        print(latin_base, c)
        if unicodedata.combining(c) and latin_base:
            continue # skip diacritic on latin base char
        preserve.append(c)
        print(preserve)
        
        # if not combining char, it's a new base char
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters
    
    shaved=''.join(preserve)

    return unicodedata.normalize('NFC', shaved)

greek = 'ŒñŒ≠œÜœÖœÅŒøœÇ, Z√©firo'

remove_marks(greek), remove_marks_latin(greek)

False Œñ
['Œñ']
False Œµ
['Œñ', 'Œµ']
False ÃÅ
['Œñ', 'Œµ', 'ÃÅ']
False œÜ
['Œñ', 'Œµ', 'ÃÅ', 'œÜ']
False œÖ
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ']
False œÅ
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ']
False Œø
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø']
False œÇ
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ']
False ,
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',']
False  
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ']
False Z
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z']
True e
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z', 'e']
True ÃÅ
True f
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z', 'e', 'f']
True i
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z', 'e', 'f', 'i']
True r
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z', 'e', 'f', 'i', 'r']
True o
['Œñ', 'Œµ', 'ÃÅ', 'œÜ', 'œÖ', 'œÅ', 'Œø', 'œÇ', ',', ' ', 'Z', 'e', 'f', 'i', 'r', 'o']


('ŒñŒµœÜœÖœÅŒøœÇ, Zefiro', 'ŒñŒ≠œÜœÖœÅŒøœÇ, Zefiro')

In [41]:
# you can go even more extreme with translations
single_map = str.maketrans("""‚Äö∆í‚ÄûÀÜ‚Äπ‚Äò‚Äô‚Äú‚Äù‚Ä¢‚Äì‚ÄîÀú‚Ä∫""",
 """'f"^<''""---~>""")

multi_map = str.maketrans({
 '‚Ç¨': 'EUR',
 '‚Ä¶': '...',
 '√Ü': 'AE',
 '√¶': 'ae',
 '≈í': 'OE',
 '≈ì': 'oe',
 '‚Ñ¢': '(TM)',
 '‚Ä∞': '<per mille>',
 '‚Ä†': '**',
 '‚Ä°': '***',
})

multi_map.update(single_map) # merge tables

def dewinize(txt: str):
    return txt.translate(multi_map)

def asciize(txt):
    no_marks = remove_marks_latin(dewinize(txt))
    no_marks = no_marks.replace('√ü', 'ss')

    return unicodedata.normalize('NFKC', no_marks)

In [42]:
dewinize(order), asciize(order)

('"Herr Vo√ü: - ¬Ω cup of OEtker(TM) caff√® latte - bowl of a√ßa√≠."',
 '"Herr Voss: - 1‚ÅÑ2 cup of OEtker(TM) caffe latte - bowl of acai."')

### Sorting Text

In [None]:
# you can sort text by comparing code points, but this doesn't work for
# non-ASCII characters

fruits = ['caju', 'atemoia', 'caj√°', 'a√ßa√≠', 'acerola']
sorted(fruits) # acai should come first, and caja comes before caju

['acerola', 'atemoia', 'a√ßa√≠', 'caju', 'caj√°']

In [None]:
# we can use locale.strxfrm instead:
import locale

my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
my_locale, sorted(fruits, key=locale.strxfrm)
# setlocale is a global setting, so don't do this in a library
# if the locale isn't in your OS, it will raise a locale.Error
# in short, YMMV

('pt_BR.UTF-8', ['a√ßa√≠', 'acerola', 'atemoia', 'caj√°', 'caju'])

In [2]:
# pyuca is a pure-Python implementation of the 
# Unicode Collation Algorithm (UCA)
import pyuca

In [3]:
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'caj√°', 'a√ßa√≠', 'acerola']
sorted(fruits, key=coll.sort_key)

['a√ßa√≠', 'acerola', 'atemoia', 'caj√°', 'caju']

In [4]:
# the Unicode 
from unicodedata import name

name('A'), name('üòç')

('LATIN CAPITAL LETTER A', 'SMILING FACE WITH HEART-SHAPED EYES')

### Example: Unicode character finder utility

In [17]:
# character finder utility
import sys, unicodedata

START, END = ord(' '), sys.maxunicode + 1

def find(*query_words, start=START, end=END):
    res = []
    query = {w.upper() for w in query_words} # set comp
    for code in range(start, end):
        char = chr(code)
        name = unicodedata.name(char, None)
        if name and query.issubset(name.split()):
            print(f'U+{code:04X}\t{char}\t{name}')
            # res.append(f'U+{code:04X}\t{char}\t{name}')

    # return res

def search(words):
    if words:
        res = find(*words)
        # return res
    else:
        print("Please give words")

search(['smiling', 'cat']), search(['tagalog'])

U+1F638	üò∏	GRINNING CAT FACE WITH SMILING EYES
U+1F63A	üò∫	SMILING CAT FACE WITH OPEN MOUTH
U+1F63B	üòª	SMILING CAT FACE WITH HEART-SHAPED EYES
U+1700	·úÄ	TAGALOG LETTER A
U+1701	·úÅ	TAGALOG LETTER I
U+1702	·úÇ	TAGALOG LETTER U
U+1703	·úÉ	TAGALOG LETTER KA
U+1704	·úÑ	TAGALOG LETTER GA
U+1705	·úÖ	TAGALOG LETTER NGA
U+1706	·úÜ	TAGALOG LETTER TA
U+1707	·úá	TAGALOG LETTER DA
U+1708	·úà	TAGALOG LETTER NA
U+1709	·úâ	TAGALOG LETTER PA
U+170A	·úä	TAGALOG LETTER BA
U+170B	·úã	TAGALOG LETTER MA
U+170C	·úå	TAGALOG LETTER YA
U+170D	·úç	TAGALOG LETTER RA
U+170E	·úé	TAGALOG LETTER LA
U+170F	·úè	TAGALOG LETTER WA
U+1710	·úê	TAGALOG LETTER SA
U+1711	·úë	TAGALOG LETTER HA
U+1712	·úí	TAGALOG VOWEL SIGN I
U+1713	·úì	TAGALOG VOWEL SIGN U
U+1714	·úî	TAGALOG SIGN VIRAMA
U+1715	·úï	TAGALOG SIGN PAMUDPOD
U+171F	·úü	TAGALOG LETTER ARCHAIC RA


(None, None)

In [15]:
# numerical character metadata
import unicodedata
import re

re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

for char in sample:
    print(f'U+{ord(char):04x}',
        char.center(6),
        're_dig' if re_digit.match(char) else '-',
        'isdig' if char.isdigit() else '-',
        'isnum' if char.isnumeric() else '-',
        f'{unicodedata.numeric(char):5.2f}',
        unicodedata.name(char),
        sep='\t')

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¬º   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ¬≤   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ‡•©   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ·ç´   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  ‚Ö´   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ‚ë¶   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ‚íÄ   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  „äÖ   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


In [19]:
# some functions in the standard library accept both str and bytes
# and behave differently for both

# examples include regular expressions (re) and os
import os
os.listdir('.'), os.listdir(b'.')

(['cafe.txt', '4.ipynb'], [b'cafe.txt', b'4.ipynb'])

To conclude:

1 character does not mean 1 byte.
Encoding and decoding is tricky. Watch out for defaults.
Normalization is needed for text matching.
You can do a lot of stuff with the Unicode database, like the utililty
to search for characters by name in 28 lines of code.

### Soapbox

Mojibake (gibberish text)

Storing str code points in RAM is flexible. Cool implementation details etc