# Chapter 4: Encoding and Annotation Schemes
## Unicode
Normalization and regexes

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## We import `regex`
The `re` module does not support Unicode well. We use `regex` instead

In [1]:
import regex as re
import unicodedata

## Code points

In [2]:
'\N{LATIN CAPITAL LETTER C}'

'C'

In [3]:
'\N{LATIN CAPITAL LETTER E WITH CIRCUMFLEX}'

'Ê'

In [4]:
'\N{LATIN CAPITAL LETTER E WITH CIRCUMFLEX}' == 'Ê'

True

In [5]:
'\N{GREEK CAPITAL LETTER GAMMA}'

'Γ'

In [6]:
ord('C'), ord('Γ')

(67, 915)

In [7]:
chr(67), chr(915)

('C', 'Γ')

In [8]:
hex(67), hex(915)

('0x43', '0x393')

## Composing characters

In [9]:
e_1 = '\N{LATIN CAPITAL LETTER E WITH CIRCUMFLEX}'
e_1

'Ê'

In [10]:
e_2 = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
e_2

'Ê'

Visually equivalent, but are they equal?

In [11]:
e_1 == e_2

False

In [12]:
[hex(ord(cp)) for cp in e_1]

['0xca']

In [13]:
[hex(ord(cp)) for cp in e_2]

['0x45', '0x302']

## Normalization

In [14]:
unicodedata.decomposition(e_1)

'0045 0302'

In [15]:
[hex(ord(cp)) for cp in unicodedata.normalize('NFD', e_1)]

['0x45', '0x302']

In [16]:
[hex(ord(cp)) for cp in unicodedata.normalize('NFD', e_2)]

['0x45', '0x302']

In [17]:
[hex(ord(cp)) for cp in unicodedata.normalize('NFC', e_1)]

['0xca']

In [18]:
[hex(ord(cp)) for cp in unicodedata.normalize('NFC', e_2)]

['0xca']

In [19]:
unicodedata.normalize('NFC', e_1) == unicodedata.normalize('NFC', e_2)

True

## Unicode Database

In [20]:
c = 'Γ'

In [21]:
ord(c), unicodedata.name(c), unicodedata.category(c)

(915, 'GREEK CAPITAL LETTER GAMMA', 'Lu')

### Western or Eastern Empire?

In [22]:
alphabet = 'αβγδεζηθικλμνξοπρστυφχψω'
match = re.search(r'^\p{InBasic_Latin}+$', alphabet)
match  # None

#### Eastern!

In [23]:
match = re.search(r'^\p{InGreek_and_Coptic}+$', alphabet)
match  # matches alphabet

<regex.Match object; span=(0, 24), match='αβγδεζηθικλμνξοπρστυφχψω'>

### Ἑλληνική

In [24]:
match = re.search(r'^\p{Greek}+$', alphabet)
match  # matches alphabet

<regex.Match object; span=(0, 24), match='αβγδεζηθικλμνξοπρστυφχψω'>

#### Searching with Unicode code points

In [25]:
match = re.search(
    r'\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}', alphabet)
match  # matches 'αβ'

<regex.Match object; span=(0, 2), match='αβ'>

#### Searching a string

In [26]:
match = re.search('αβ', alphabet)
match  # matches 'αβ'

<regex.Match object; span=(0, 2), match='αβ'>

## Sorting with a Locale

#### Using Python's `locale` module

`sort()` calls the underlying operating system.
This means that it may produce different results on different systems.
It does not work properly on macOS. (Update for macOS 14.1: Apparently it does)

In [27]:
import locale

locale.getlocale()

('C', 'UTF-8')

In [28]:
locale.getlocale(locale.LC_CTYPE)

('C', 'UTF-8')

In [29]:
locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')

'fr_FR.UTF-8'

In [30]:
locale.getlocale(locale.LC_COLLATE)

('fr_FR', 'UTF-8')

In [31]:
# accented = 'aàäeéèêëiîïoôöœuûüαβγ'
accented = 'aäeé'
accented += accented.upper()
accented

'aäeéAÄEÉ'

In [32]:
sorted(accented)
['A', 'E', 'a', 'e', 'Ä', 'É', 'ä', 'é']

['A', 'E', 'a', 'e', 'Ä', 'É', 'ä', 'é']

In [33]:
sorted(accented, key=locale.strxfrm)

['a', 'A', 'ä', 'Ä', 'e', 'E', 'é', 'É']

With an English locale

In [34]:
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

In [35]:
sorted(accented, key=locale.strxfrm)

['a', 'A', 'ä', 'Ä', 'e', 'E', 'é', 'É']

With a Swedish locale

In [36]:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')

'sv_SE.UTF-8'

In [37]:
sorted(accented, key=locale.strxfrm)

['a', 'A', 'e', 'E', 'é', 'É', 'ä', 'Ä']

#### Using ICU

In [38]:
import icu

French locale

In [39]:
collator = icu.Collator.createInstance(icu.Locale('fr_FR.UTF8'))

In [40]:
sorted(accented, key=collator.getSortKey)

['a', 'A', 'ä', 'Ä', 'e', 'E', 'é', 'É']

Swedish locale

In [41]:
collator = icu.Collator.createInstance(icu.Locale('sv_SE.UTF8'))

In [42]:
sorted(accented, key=collator.getSortKey)

['a', 'A', 'e', 'E', 'é', 'É', 'ä', 'Ä']