# Chapter 11: Unicode Deep Dive

Unicode is the universal standard for representing text from every writing system. Python's `str` type is built on Unicode, and the `unicodedata` module provides tools for inspecting and manipulating Unicode characters. This notebook covers code points, character properties, normalization forms, and comparison gotchas.

## Key Concepts
- **Code point**: A unique integer assigned to each character (e.g., U+0041 = 'A')
- **Character name**: A descriptive name for each code point (e.g., LATIN CAPITAL LETTER A)
- **Category**: Two-letter classification (e.g., Lu = Letter, uppercase)
- **Normalization**: Canonical equivalence between different representations

## Unicode Code Points and Character Names

Every Unicode character has a unique code point (an integer) and a name. Python uses `ord()` and `chr()` to convert between characters and code points, and `unicodedata.name()` to look up names.

In [None]:
import unicodedata

# Code points and character names
characters: list[str] = ["A", "√©", "Œ±", "‰∏≠", "‚Ç¨", "üêç"]

print(f"{'Char':<6} {'Code Point':<12} {'Name':<40} {'Category'}")
print("-" * 75)
for char in characters:
    cp = ord(char)
    name = unicodedata.name(char, "<unknown>")
    cat = unicodedata.category(char)
    print(f"{char:<6} U+{cp:04X}{'':>5} {name:<40} {cat}")

# Reverse lookup: name -> character
euro = unicodedata.lookup("EURO SIGN")
print(f"\nunicodedata.lookup('EURO SIGN') = {euro!r}")

# Using \N{} escape in string literals
snowman = "\N{SNOWMAN}"
print(f"\\N{{SNOWMAN}} = {snowman}")

## The unicodedata Module

The `unicodedata` module provides access to the Unicode Character Database. Key functions include `name()`, `category()`, `normalize()`, `numeric()`, and `bidirectional()`.

In [None]:
import unicodedata

# Exploring character properties
char: str = "√©"
print(f"Character: {char!r}")
print(f"Name: {unicodedata.name(char)}")
print(f"Category: {unicodedata.category(char)}")
print(f"Decimal: {unicodedata.decimal(char, None)}")
print(f"Numeric: {unicodedata.numeric(char, None)}")
print(f"Bidirectional: {unicodedata.bidirectional(char)}")

# Numeric values for various scripts
numeric_chars: list[str] = ["5", "\u0665", "\u0e55", "\u4e94", "\u2165"]
print(f"\n{'Char':<6} {'Name':<35} {'Numeric Value'}")
print("-" * 60)
for ch in numeric_chars:
    name = unicodedata.name(ch, "<unknown>")
    num = unicodedata.numeric(ch, None)
    print(f"{ch:<6} {name:<35} {num}")

## Unicode Categories

Each Unicode character belongs to a **general category** identified by a two-letter code. The first letter indicates the major class, the second letter the subclass.

| Code | Meaning | Example |
|------|---------|--------|
| Lu | Letter, uppercase | A, B, Z |
| Ll | Letter, lowercase | a, b, z |
| Nd | Number, decimal digit | 0-9, Arabic-Indic digits |
| Zs | Separator, space | space, no-break space |
| Mn | Mark, nonspacing | combining accents |
| Sc | Symbol, currency | $, EUR, GBP |
| So | Symbol, other | emojis, misc symbols |

In [None]:
import unicodedata

# Demonstrating Unicode categories
test_chars: list[tuple[str, str]] = [
    ("A", "uppercase letter"),
    ("a", "lowercase letter"),
    ("3", "decimal digit"),
    (" ", "space"),
    ("\u0301", "combining acute accent"),
    ("$", "dollar sign"),
    ("‚Ç¨", "euro sign"),
    ("\u00A0", "no-break space"),
]

print(f"{'Char':<8} {'Repr':<12} {'Category':<10} {'Description'}")
print("-" * 55)
for char, desc in test_chars:
    cat = unicodedata.category(char)
    print(f"{char:<8} {char!r:<12} {cat:<10} {desc}")

# Verify specific categories from the test file
assert unicodedata.category("A") == "Lu"  # Letter, uppercase
assert unicodedata.category("3") == "Nd"  # Number, decimal digit
assert unicodedata.category(" ") == "Zs"  # Separator, space
print("\nAll category assertions passed.")

## NFC vs NFD Normalization

Some characters can be represented in multiple ways. For example, 'e' (U+00E9) can be:
- **NFC** (Composed): A single code point `\u00e9`
- **NFD** (Decomposed): Base letter `e` (U+0065) + combining acute accent (U+0301)

These look identical on screen but are different sequences of code points, which affects string comparison.

In [None]:
import unicodedata

# NFC (Composed) vs NFD (Decomposed)
nfc: str = "\u00e9"          # √© as single codepoint (LATIN SMALL LETTER E WITH ACUTE)
nfd: str = "\u0065\u0301"   # e + combining acute accent

print(f"NFC: {nfc!r}  (len={len(nfc)})")
print(f"NFD: {nfd!r}  (len={len(nfd)})")
print(f"Look the same: '{nfc}' vs '{nfd}'")
print(f"But are NOT equal: {nfc == nfd}")

# Normalize to compare
nfc_normalized = unicodedata.normalize("NFC", nfd)
nfd_normalized = unicodedata.normalize("NFD", nfc)

print(f"\nNormalize NFD -> NFC: {nfc_normalized!r} (len={len(nfc_normalized)})")
print(f"Normalize NFC -> NFD: {nfd_normalized!r} (len={len(nfd_normalized)})")
print(f"After NFC normalization, equal: {nfc == nfc_normalized}")
print(f"After NFD normalization, equal: {nfd == nfd_normalized}")

In [None]:
import unicodedata

# NFKC and NFKD: compatibility normalization
# These also replace compatibility characters with their canonical forms

# Example: the 'fi' ligature and the superscript '2'
ligature: str = "\ufb01"  # fi ligature
superscript: str = "\u00b2"  # superscript 2
half: str = "\u00bd"  # vulgar fraction one half

print(f"{'Original':<12} {'NFKC':<12} {'NFKD':<12} {'Description'}")
print("-" * 55)
for char, desc in [(ligature, "fi ligature"), (superscript, "superscript 2"), (half, "fraction 1/2")]:
    nfkc = unicodedata.normalize("NFKC", char)
    nfkd = unicodedata.normalize("NFKD", char)
    print(f"{char!r:<12} {nfkc!r:<12} {nfkd!r:<12} {desc}")

# Practical use: search normalization
def normalize_for_search(text: str) -> str:
    """Normalize text for case-insensitive, accent-insensitive search."""
    # NFKD decomposes, then we can strip combining marks
    decomposed = unicodedata.normalize("NFKD", text)
    # Remove combining characters (category 'M')
    stripped = "".join(c for c in decomposed if not unicodedata.category(c).startswith("M"))
    return stripped.casefold()

print(f"\nSearch normalization examples:")
for word in ["caf√©", "CAF√â", "r√©sum√©", "na√Øve"]:
    print(f"  {word!r:>12} -> {normalize_for_search(word)!r}")

## Handling Combining Characters

Combining characters (category 'Mn') are diacritical marks that attach to the preceding base character. They are zero-width and modify how the base character is displayed.

In [None]:
import unicodedata

# Combining characters example
base: str = "e"
acute: str = "\u0301"      # COMBINING ACUTE ACCENT
cedilla: str = "\u0327"    # COMBINING CEDILLA
tilde: str = "\u0303"     # COMBINING TILDE

combined_e_acute = base + acute
combined_e_cedilla = base + cedilla
combined_n_tilde = "n" + tilde

print(f"e + combining acute:   {combined_e_acute!r}  displays as: {combined_e_acute}")
print(f"e + combining cedilla: {combined_e_cedilla!r}  displays as: {combined_e_cedilla}")
print(f"n + combining tilde:   {combined_n_tilde!r}  displays as: {combined_n_tilde}")

# Stacking multiple combining characters
stacked = "a" + "\u0300" + "\u0301" + "\u0302"  # grave + acute + circumflex
print(f"\nStacked accents: {stacked!r}  displays as: {stacked}")
print(f"Length: {len(stacked)} code points (but visually one character)")

# Detecting combining characters
text = "caf√©"  # NFD form might have combining characters
nfd_text = unicodedata.normalize("NFD", text)
print(f"\n{text!r} in NFD form: {nfd_text!r}")
for i, ch in enumerate(nfd_text):
    cat = unicodedata.category(ch)
    name = unicodedata.name(ch)
    is_combining = cat.startswith("M")
    print(f"  [{i}] U+{ord(ch):04X} {cat} {name}{' (combining)' if is_combining else ''}")

## String Comparison Gotchas with Unicode

Because the same visual character can have multiple representations, direct string comparison (`==`) can produce surprising results. Always normalize before comparing.

In [None]:
import unicodedata

# Gotcha 1: Same-looking strings that are not equal
s1: str = "caf\u00e9"         # '√©' as single code point
s2: str = "cafe\u0301"        # 'e' + combining acute

print(f"s1: {s1!r} -> {s1}")
print(f"s2: {s2!r} -> {s2}")
print(f"s1 == s2: {s1 == s2}  (surprise!)")
print(f"len(s1)={len(s1)}, len(s2)={len(s2)}")

# Fix: normalize before comparing
s1_nfc = unicodedata.normalize("NFC", s1)
s2_nfc = unicodedata.normalize("NFC", s2)
print(f"\nAfter NFC normalization:")
print(f"s1_nfc == s2_nfc: {s1_nfc == s2_nfc}")

# Gotcha 2: Different characters that look similar
latin_a: str = "A"         # U+0041 LATIN CAPITAL LETTER A
greek_a: str = "\u0391"   # U+0391 GREEK CAPITAL LETTER ALPHA
cyrillic_a: str = "\u0410"  # U+0410 CYRILLIC CAPITAL LETTER A

print(f"\nHomoglyphs (look-alikes):")
print(f"  Latin A:    {latin_a!r} (U+{ord(latin_a):04X})")
print(f"  Greek A:    {greek_a!r} (U+{ord(greek_a):04X})")
print(f"  Cyrillic A: {cyrillic_a!r} (U+{ord(cyrillic_a):04X})")
print(f"  All look like 'A' but: {latin_a == greek_a} {latin_a == cyrillic_a}")

## casefold() vs lower() for Case-Insensitive Comparison

`str.casefold()` is more aggressive than `str.lower()` and is the correct choice for case-insensitive comparisons, especially with non-ASCII text. It handles special cases like the German eszett (sharp s).

In [None]:
# casefold() vs lower()
# For ASCII text, they behave the same
ascii_text: str = "HELLO WORLD"
print(f"ASCII lower():    {ascii_text.lower()!r}")
print(f"ASCII casefold(): {ascii_text.casefold()!r}")

# For the German eszett, they differ
eszett: str = "Stra√üe"  # German word with sharp s
print(f"\nOriginal:  {eszett!r}")
print(f"lower():   {eszett.lower()!r}")
print(f"casefold(): {eszett.casefold()!r}")
print(f"Note: casefold() converts '√ü' to 'ss'")

# Practical comparison function
def case_insensitive_equal(a: str, b: str) -> bool:
    """Compare strings case-insensitively using casefold."""
    return a.casefold() == b.casefold()

# Test cases
pairs: list[tuple[str, str]] = [
    ("hello", "HELLO"),
    ("Stra√üe", "STRASSE"),
    ("caf√©", "CAF√â"),
    ("Œ£ŒØœÉœÖœÜŒøœÇ", "Œ£ŒäŒ£Œ•Œ¶ŒüŒ£"),
]
print(f"\n{'String A':<15} {'String B':<15} {'lower==':<10} {'casefold=='}")
print("-" * 55)
for a, b in pairs:
    lower_eq = a.lower() == b.lower()
    fold_eq = a.casefold() == b.casefold()
    print(f"{a:<15} {b:<15} {str(lower_eq):<10} {fold_eq}")

In [None]:
import unicodedata

# Putting it all together: a robust string comparison utility
def unicode_equal(
    a: str,
    b: str,
    *,
    case_sensitive: bool = True,
    normalization: str = "NFC",
) -> bool:
    """Compare two Unicode strings with normalization and optional case folding."""
    a_norm = unicodedata.normalize(normalization, a)
    b_norm = unicodedata.normalize(normalization, b)
    if not case_sensitive:
        a_norm = a_norm.casefold()
        b_norm = b_norm.casefold()
    return a_norm == b_norm

# Test with various tricky cases
print("Robust Unicode comparison:")
print(f"  NFC vs NFD 'caf√©':  {unicode_equal('caf\u00e9', 'cafe\u0301')}")
print(f"  Case-insensitive:   {unicode_equal('Caf√©', 'CAF√â', case_sensitive=False)}")
print(f"  Stra√üe vs STRASSE:  {unicode_equal('Stra√üe', 'STRASSE', case_sensitive=False)}")
print(f"  Exact match:        {unicode_equal('hello', 'Hello', case_sensitive=True)}")

## Summary

### Key Takeaways

- Every Unicode character has a **code point** (`ord()`) and a **name** (`unicodedata.name()`)
- Characters belong to **categories** (Lu, Ll, Nd, Zs, Mn, etc.) accessible via `unicodedata.category()`
- **NFC** (composed) and **NFD** (decomposed) are different representations of the same character
- **NFKC/NFKD** additionally replace compatibility characters (ligatures, superscripts)
- Always **normalize** strings before comparison to avoid false mismatches
- Use `casefold()` instead of `lower()` for case-insensitive comparison -- it handles special cases like German eszett
- **Combining characters** (category 'Mn') attach to preceding base characters and affect string length
- **Homoglyphs** (look-alike characters from different scripts) are not equal despite appearing identical