In [None]:
# default_exp chars

# chars

> Set of functions used to preprocess french text characters.

External dependencies :

> pip install pandas

Configure tabular data display in this notebook :

In [None]:
# export
import pandas as pd
pd.options.display.max_rows = 100
pd.options.display.max_columns = 50

# Character set normalization for french

In [None]:
# export
from frenchtext.core import *

The config object from frenchtext.core defines the directory where the character normalization tables are located :

In [None]:
# export
chardatadir = config.libdata / "chars"

In [None]:
!ls {chardatadir}

charset-fr.csv		  latinletters.csv     unicode_categories.csv
charsetstats_norm.csv	  latinnumbers.csv     unicode_families.csv
charsetstats_raw.csv	  latinsymbols.csv     unsupported.stats.csv
combiningdiacritics.csv   normalizedchars.csv  utf8-windows1252-errors.csv
controlchars.csv	  stats		       windows1252-iso8859-errors.csv
cyrillic-greek-chars.csv  unicode_blocks.csv   windows1252-utf8-errors.csv


## 1. Explore french dataset characters

French datasets often contain several thousands distinct Unicode characters.

We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :
- chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
- with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
- this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with

### 1.1 Characters frequency in french datasets

In [None]:
dfcharstats = pd.read_csv(chardatadir / "charsetstats_raw.csv", sep=";")
dfcharstats

Unnamed: 0.1,Unnamed: 0,Code,Char,Name,Category,Subcategory,Block,CountBusiness,CountWikipedia,Count
0,0,101,e,Latin Small Letter E,Letter,Lowercase,Basic Latin,3.503992e+09,4.595437e+09,8.099428e+09
1,1,115,s,Latin Small Letter S,Letter,Lowercase,Basic Latin,1.960554e+09,2.534105e+09,4.494658e+09
2,2,97,a,Latin Small Letter A,Letter,Lowercase,Basic Latin,1.865590e+09,2.447239e+09,4.312829e+09
3,3,110,n,Latin Small Letter N,Letter,Lowercase,Basic Latin,1.819350e+09,2.388609e+09,4.207959e+09
4,5,105,i,Latin Small Letter I,Letter,Lowercase,Basic Latin,1.766427e+09,2.331461e+09,4.097888e+09
...,...,...,...,...,...,...,...,...,...,...
13497,13495,37294,醮,Cjk Unified Ideograph-91Ae,Letter,Other,CJK Unified Ideographs,0.000000e+00,1.000000e+00,1.000000e+00
13498,13496,35824,诰,Cjk Unified Ideograph-8Bf0,Letter,Other,CJK Unified Ideographs,0.000000e+00,1.000000e+00,1.000000e+00
13499,13497,26634,栊,Cjk Unified Ideograph-680A,Letter,Other,CJK Unified Ideographs,0.000000e+00,1.000000e+00,1.000000e+00
13500,13498,31787,簫,Cjk Unified Ideograph-7C2B,Letter,Other,CJK Unified Ideographs,0.000000e+00,1.000000e+00,1.000000e+00


### 1.2 Characters stats in Wikipedia dataset

- 35.6 billion chars

In [None]:
charsCountWikipedia = dfcharstats["CountWikipedia"].sum()
charsCountWikipedia

35682395281.0

- 13 502 distinct Unicode chars

In [None]:
distinctCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>0])
distinctCharsWikipedia

13502

- Only 1316 chars more frequent than 1 in 100 million

In [None]:
frequentCharsWikipedia = len(dfcharstats[dfcharstats["CountWikipedia"]>356])
frequentCharsWikipedia

1316

- Frequent chars represent 9.7 % of all distinct Unicode chars

In [None]:
pctFreqCharsWikipedia = frequentCharsWikipedia/distinctCharsWikipedia*100
pctFreqCharsWikipedia

9.74670419197156

- 99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars

In [None]:
pctPreservedCharsWikipedia = (1-dfcharstats[dfcharstats["CountWikipedia"]<=356]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedCharsWikipedia

99.99871204274157

### 1.3 Characters stats in Business dataset

- 27.5 billion chars

In [None]:
charsCountBusiness = dfcharstats["CountBusiness"].sum()
charsCountBusiness

27577304956.0

-  3 763 distinct Unicode chars

In [None]:
distinctCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>0])
distinctCharsBusiness

3763

- Only 531 chars more frequent than 1 in 100 million

In [None]:
frequentCharsBusiness = len(dfcharstats[dfcharstats["CountBusiness"]>275])
frequentCharsBusiness

531

- Frequent chars represent 14.1 % of all distinct Unicode chars

In [None]:
pctFreqCharsBusiness = frequentCharsBusiness/distinctCharsBusiness*100
pctFreqCharsBusiness

14.11108158384268

- 99.9996 % of Business chars would be preserved if we only kept the frequent chars

In [None]:
pctPreservedCharsBusiness = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountBusiness"].sum()/dfcharstats["CountBusiness"].sum())*100
pctPreservedCharsBusiness

99.9996564385093

- 99.985 % of Wikipedia chars would be preserved if we only kept the frequent Business chars

In [None]:
pctPreservedBizCharsInWikipedia = (1-dfcharstats[dfcharstats["CountBusiness"]<=275]["CountWikipedia"].sum()/dfcharstats["CountWikipedia"].sum())*100
pctPreservedBizCharsInWikipedia

99.9848317525845

### 1.4 Character stats after Unicode normalization

After applying the normalization process defined below in this notebook, here are the remaining chars :

In [None]:
dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm

Unnamed: 0,FrCode,Category,SubCategory,Code,Char,CharName,CountBusiness
0,0,separator,control,0,,Reserved - End of string,0
1,1,separator,space,32,,Space,88494564
2,2,separator,space,10,\n,Char 10,9588147
3,3,separator,space,9,\t,Char 9,1522053
4,4,separator,punctuation,44,",",Comma,286106887
...,...,...,...,...,...,...,...
251,251,emoticon,object,9792,♀,Female Sign,515
252,252,emoticon,object,127881,🎉,Party Popper,356
253,253,emoticon,object,9997,✍,Writing Hand,157
254,254,emoticon,object,9993,✉,Envelope,55


#### Stats for the character families after normalization

The table below shows the number of chars in each category (after normalization) **per 100 million characters** :

In [None]:
dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / charsCountBusiness * 100000000).astype(int)
dfblocks

Unnamed: 0_level_0,Unnamed: 1_level_0,Char,Char,CountBusiness
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,sum
Category,SubCategory,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
emoticon,hand,12,💪👉👍👏🙏🙌👇👊👎👌✌✊,42
emoticon,head,28,🙂😉😀😂😁😊🙁😅😍😃😡🤣😄🤔😎😭👹😱😜😋🤩🙄😆😛🤪😢😇🤦,233
emoticon,object,16,⚠🔴🔥🏆⚽💡🚨💥⚡♫♂♀🎉✍✉✝,60
letter,digit,10,0123549876,3271115
letter,encoding,3,Ã�￼,249
letter,greek,2,λπ,2
letter,latin-fr,84,abcdefghijklmnopqrstuvwxyzàâäçèéêëîïôöùûüÿABCD...,91437146
letter,latin-other,25,áãåćčėğıíìńñóòõøšşßúÁÅŠÚŽ,712
letter,other,5,_&@\#,40814
separator,control,0,0,0


# 2. Characters normalization pipeline

After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.

We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.

It may then be useful to limit the number of distinct characters after normalization to **255 distinct characters** : 
- if needed, french text chars can then be encoded with a single byte
- the list of supported chars can be memorized by NLP application developers and users

The normalization pipeline applies the following **14 steps**, which are explained and illustrated in the sections below.

- Fix encoding errors
  - fix windows1252 text read as iso8859-1
  - fix utf8 text read as windows1252
  - fix windows1252 text read as utf8
  - merge Unicode combining chars
  - ignore control chars
- Remove display attributes
  - replace latin letter symbols
  - replace latin letter ligatures
  - replace latin number symbols
- Normalize visually equivalent chars
  - replace equivalent chars 
  - replace cyrillic and greek chars looking like latin letters
- Encode infrequent chars while losing a little bit of information 
  - replace infrequent latin letters with diacritics
  - replace infrequent chars from other scripts
  - replace infrequent symbols 
  - ignore remaining chars with no glyph 

### 2.1 Frequent encoding errors : windows1252 read as iso8859-1

In [None]:
dfencodingwin1252 = pd.read_csv(chardatadir / "windows1252-iso8859-errors.csv", sep=";")
dfencodingwin1252.head(10)

Unnamed: 0,Code,Char,DecodedCode,DecodedChar
0,146,,8217,’
1,128,,8364,€
2,133,,8230,…
3,150,,8211,–
4,156,,339,œ
5,149,,8226,•
6,147,,8220,“
7,148,,8221,”
8,151,,8212,—
9,145,,8216,‘


In [None]:
print(f"{len(dfencodingwin1252)} frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1")

10 frequent encoding errors seen in french datasets : a character encoded as windows1252 was incorrectly decoded as iso8859-1


Columns :
- Code/Char : incorrectly decoded control char seen in french text
- DecodedCode/DecodedChare : properly decoded char which should replace the original control char

### 2.2 Frequent encoding errors : utf8 read as windows1252

In [None]:
dfencodingutf8 = pd.read_csv(chardatadir / "utf8-windows1252-errors.csv", sep=";")
dfencodingutf8.head(10)

Unnamed: 0,ErrorSubstring,DecodedCode,DecodedChar
0,â‚¬,8364,€
1,â€š,8218,‚
2,Æ’,402,ƒ
3,â€ž,8222,„
4,â€¦,8230,…
5,â€,8224,†
6,â€¡,8225,‡
7,Ë†,710,ˆ
8,â€°,8240,‰
9,Å,352,Š


In [None]:
print(f"{len(dfencodingutf8)} very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252")

117 very unlikely substrings produced when text encoded with UTF-8 is decoded by mistake as iso8859-1 or windows1252


Columns :
- ErrorSubstring : unlikely substring of length 2 or 3 characters produced when UTF-8 text is decoded by mistake as windows1252
- DecodedCode/DecodedChar : properly decoded char which should be used to replace the unlikley substring

### 2.3 Frequent encoding errors : windows1252 read as utf8

In [None]:
dfencodingwin1252utf8 = pd.read_csv(chardatadir / "windows1252-utf8-errors.csv", sep=";")
dfencodingwin1252utf8.head()

Unnamed: 0,Code,Char,DecodedCodes,DecodedChars
0,38971,頻,"[233, 160, 187]",é »


In [None]:
print(f"{len(dfencodingwin1252utf8)} char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8")

1 char very unlikely in french text produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8


Columns :
- Char : unlikely char produced when text encoded with iso8859-1 or windows1252 is decoded by mistake as UTF-8
- DecodedCodes/DecodedChars : properly decoded substring which should be used to replace the unlikley char

### 2.4 Unicode combining chars

In [None]:
dfcombiningchars = pd.read_csv(chardatadir / "combiningdiacritics.csv", sep=";")
dfcombiningchars.head()

Unnamed: 0,BaseChar,Code,Char,Diacritic,CombinedChar
0,A,769,́,Acute,Á
1,E,769,́,Acute,É
2,I,769,́,Acute,Í
3,O,769,́,Acute,Ó
4,U,769,́,Acute,Ú


In [None]:
print(f"{len(dfcombiningchars['Char'].unique())} combining chars {list(dfcombiningchars['Diacritic'].unique())} should be recombined with {len(dfcombiningchars)} base latin characters to produce standard latin characters with diacritics")

12 combining chars ['Acute', 'Grave', 'Circumflex', 'Cedilla', 'Tilde', 'Diaeresis', 'Long Stroke Overlay', 'Macron', 'Caron', 'Dot Below', 'Dot Above', 'Ring Above'] should be recombined with 274 base latin characters to produce standard latin characters with diacritics


Columns :
- BaseChar : latin char encountered first in the string, which will be modified by the combining char immediately following it
- Code/Char : combining char immediately following BaseChar, which should be combined with it to produce CombinedChar
- Diacritic : type of accent / diacritic applied by the combining char
- CombinedChar : latin char with diacritic produced by the combination of BaseChar and the combining Char following it

### 2.5 Control chars

In [None]:
dfcontrolchars = pd.read_csv(chardatadir / "controlchars.csv", sep=";")
dfcontrolchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
dfcontrolchars

Unnamed: 0,Code,Char,CharName
0,0,�,Char 0
1,1,,Char 1
2,2,,Char 2
3,3,,Char 3
4,4,,Char 4
...,...,...,...
120,65532,￼,Object Replacement Character
121,127995,🏻,Emoji Modifier Fitzpatrick Type-1-2
122,127996,🏼,Emoji Modifier Fitzpatrick Type-3
123,127997,🏽,Emoji Modifier Fitzpatrick Type-4


In [None]:
print(f"{len(dfcontrolchars)} control chars seen in french datasets, which can't be displayed and should be ignored")

125 control chars seen in french datasets, which can't be displayed and should be ignored


Columns :
- Code : Unicode code point for the character
- Char : control character
- CharName : name of the character in the Python Unicode database

### 2.6 Latin letter symbols

In [None]:
dflatinsymbols = pd.read_csv(chardatadir / "latinsymbols.csv", sep=";")
dflatinsymbols.head(10)

Unnamed: 0,Code,Char,CharName,NormString,Layout
0,8253,‽,Interrobang,?!,
1,8265,⁉,Exclamation Question Mark,!?,
2,8448,℀,Account Of,a/c,
3,8449,℁,Addressed To The Subject,a/s,
4,8450,ℂ,Double-Struck Capital C,C,Double-Struck
5,8451,℃,Degree Celsius,°C,Unit
6,8453,℅,Care Of,c/o,
7,8454,℆,Cada Una,c/u,
8,8457,℉,Degree Fahrenheit,°F,Unit
9,8458,ℊ,Script Small G,g,Script


In [None]:
dflatinsymbols[230:240]

Unnamed: 0,Code,Char,CharName,NormString,Layout
230,119908,𝑤,Mathematical Italic Small W,w,Mathematical Italic
231,119909,𝑥,Mathematical Italic Small X,x,Mathematical Italic
232,119910,𝑦,Mathematical Italic Small Y,y,Mathematical Italic
233,119911,𝑧,Mathematical Italic Small Z,z,Mathematical Italic
234,119912,𝑨,Mathematical Bold Italic Capital A,A,Mathematical Bold Italic
235,119913,𝑩,Mathematical Bold Italic Capital B,B,Mathematical Bold Italic
236,119914,𝑪,Mathematical Bold Italic Capital C,C,Mathematical Bold Italic
237,119915,𝑫,Mathematical Bold Italic Capital D,D,Mathematical Bold Italic
238,119916,𝑬,Mathematical Bold Italic Capital E,E,Mathematical Bold Italic
239,119917,𝑭,Mathematical Bold Italic Capital F,F,Mathematical Bold Italic


In [None]:
print(f"{len(dflatinsymbols)} Unicode symbols which represent latin letters with a specific layout like {list(dflatinsymbols['Layout'].unique())}")

917 Unicode symbols which represent latin letters with a specific layout like [nan, 'Double-Struck', 'Unit', 'Script', 'Black-Letter', 'Turned', 'Rotated', 'Turned Sans-Serif', 'Reversed Sans-Serif', 'Double-Struck Italic', 'Parenthesized', 'Circled', 'Mathematical Bold', 'Mathematical Italic', 'Mathematical Bold Italic', 'Mathematical Script', 'Mathematical Bold Script', 'Mathematical Fraktur', 'Mathematical Double-Struck', 'Mathematical Bold Fraktur', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Sans-Serif Italic', 'Mathematical Sans-Serif Bold Italic', 'Mathematical Monospace', 'Tortoise Shell Bracketed', 'Circled Italic', 'Squared', 'Negative Circled', 'Negative Squared', 'Crossed Negative Squared', 'Regional Indicator']


Columns :
- Code/Char/CharName : Unicode symbol representing a latin letter with a specific layout
- NormString : normalized string using only very frequent chars
- Layout : info about the specific layout applied to the latin char

### 2.7 Latin letters ligatures / Latin letters diacritics

In [None]:
dflatinletters = pd.read_csv(chardatadir / "latinletters.csv", sep=";")
dflatinletters[89:99]

Unnamed: 0,Code,Char,LetterName,IsUpper,UpperChar,IsLower,LowerChar,IsDiacritic,BaseChar,Diacritics,IsLigature,MultiChars,CharName,Block,Category,SubCategory
89,230,æ,Ae,False,Æ,True,æ,False,,,True,ae,Latin Small Letter Ae,Latin-1 Supplement,Letter,Lowercase
90,231,ç,C,False,Ç,True,ç,True,c,Cedilla,False,,Latin Small Letter C With Cedilla,Latin-1 Supplement,Letter,Lowercase
91,232,è,E,False,È,True,è,True,e,Grave,False,,Latin Small Letter E With Grave,Latin-1 Supplement,Letter,Lowercase
92,233,é,E,False,É,True,é,True,e,Acute,False,,Latin Small Letter E With Acute,Latin-1 Supplement,Letter,Lowercase
93,234,ê,E,False,Ê,True,ê,True,e,Circumflex,False,,Latin Small Letter E With Circumflex,Latin-1 Supplement,Letter,Lowercase
94,235,ë,E,False,Ë,True,ë,True,e,Diaeresis,False,,Latin Small Letter E With Diaeresis,Latin-1 Supplement,Letter,Lowercase
95,236,ì,I,False,Ì,True,ì,True,i,Grave,False,,Latin Small Letter I With Grave,Latin-1 Supplement,Letter,Lowercase
96,237,í,I,False,Í,True,í,True,i,Acute,False,,Latin Small Letter I With Acute,Latin-1 Supplement,Letter,Lowercase
97,238,î,I,False,Î,True,î,True,i,Circumflex,False,,Latin Small Letter I With Circumflex,Latin-1 Supplement,Letter,Lowercase
98,239,ï,I,False,Ï,True,ï,True,i,Diaeresis,False,,Latin Small Letter I With Diaeresis,Latin-1 Supplement,Letter,Lowercase


In [None]:
print(f"{len(dflatinletters)} chars representing latin letters, {len(dflatinletters[dflatinletters['IsUpper']])} upper case and {len(dflatinletters[dflatinletters['IsLower']])} lower case, {len(dflatinletters[dflatinletters['IsDiacritic']])} with diacritics like {list(dflatinletters[dflatinletters['IsDiacritic']]['Diacritics'].unique())[0:20]}, {len(dflatinletters[dflatinletters['IsLigature']])} representing multiple letters in ligature")

1230 chars representing latin letters, 459 upper case and 704 lower case, 1031 with diacritics like ['Grave', 'Acute', 'Circumflex', 'Tilde', 'Diaeresis', 'Ring Above', 'Cedilla', 'Stroke', 'Macron', 'Breve', 'Ogonek', 'Dot Above', 'Caron', 'Dotless', 'Middle Dot', 'Preceded By Apostrophe', 'Double Acute', 'Long', 'Hook', 'Topbar'], 88 representing multiple letters in ligature


Columns :
- Code/Char/CharName : Unicode character representing one or more latin letters
- LetterName : name of the latin letter (without case and diacritics qualifiers)
- IsUpper/UpperChar and IsLower/LowerChar : upper case or lower case equivalent chars
- IsDiacritic => BaseChar : equivalent char without any diacritic (accents ...), Diacritics : description of all diacritics applied to the char
- IsLigature => MultiChars : if the char represents multiple latin letters in a single ligature, string representing the equivalent list of letters
- Block/Category/SubCategory : Unicode classification for each char

### 2.8 Latin numbers and number symbols

In [None]:
dflatinnumbers = pd.read_csv(chardatadir / "latinnumbers.csv", sep=";")
dflatinnumbers[30:40]

Unnamed: 0,Code,Char,CharName,NormString,Layout
30,8327,₇,Subscript Seven,(7),Subscript
31,8328,₈,Subscript Eight,(8),Subscript
32,8329,₉,Subscript Nine,(9),Subscript
33,8528,⅐,Vulgar Fraction One Seventh,1/7,Vulgar Fraction
34,8529,⅑,Vulgar Fraction One Ninth,1/9,Vulgar Fraction
35,8530,⅒,Vulgar Fraction One Tenth,1/10,Vulgar Fraction
36,8531,⅓,Vulgar Fraction One Third,1/3,Vulgar Fraction
37,8532,⅔,Vulgar Fraction Two Thirds,2/3,Vulgar Fraction
38,8533,⅕,Vulgar Fraction One Fifth,1/5,Vulgar Fraction
39,8534,⅖,Vulgar Fraction Two Fifths,2/5,Vulgar Fraction


In [None]:
dflatinnumbers[200:210]

Unnamed: 0,Code,Char,CharName,NormString,Layout
200,12881,㉑,Circled Number Twenty One,(21),Circled
201,12882,㉒,Circled Number Twenty Two,(22),Circled
202,12883,㉓,Circled Number Twenty Three,(23),Circled
203,12884,㉔,Circled Number Twenty Four,(24),Circled
204,12885,㉕,Circled Number Twenty Five,(25),Circled
205,12886,㉖,Circled Number Twenty Six,(26),Circled
206,12887,㉗,Circled Number Twenty Seven,(27),Circled
207,12888,㉘,Circled Number Twenty Eight,(28),Circled
208,12889,㉙,Circled Number Twenty Nine,(29),Circled
209,12890,㉚,Circled Number Thirty,(30),Circled


In [None]:
print(f"{len(dflatinnumbers)} chars representing latin digits, some with specific layouts like {list(dflatinnumbers['Layout'].unique())[1:]}")

302 chars representing latin digits, some with specific layouts like ['Superscript', 'Vulgar Fraction', 'Subscript', 'Roman Numeral', 'Small Roman Numeral', 'Circled', 'Parenthesized', ' Full Stop', 'Negative Circled', 'Double Circled', 'Dingbat Negative Circled', 'Dingbat Circled Sans-Serif', 'Dingbat Negative Circled Sans-Serif ', 'Circled On Black Square', 'Fullwidth', 'Mathematical Bold', 'Mathematical Double-Struck', 'Mathematical Sans-Serif', 'Mathematical Sans-Serif Bold', 'Mathematical Monospace', 'Full Stop', 'Comma']


Columns :
- Code/Char/CharName : Unicode char representing on or more latin digits
- NormString : normalized string representing the equivalent number, plus punctuation if needed
- Layout : info about the specific layout applied to the latin digits

### 1.8 Variations on frequent chars to normalize

In [None]:
dfnormchars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=";")
dfnormchars.head()

Unnamed: 0,Code,Char,CharName,NormCode,NormChar,NormCharName
0,11,,Char 11,10,\n,Char 10
1,13,\r,Char 13,10,\n,Char 10
2,182,¶,Pilcrow Sign,10,\n,Char 10
3,8232,,Line Separator,10,\n,Char 10
4,160,,No-Break Space,32,,Space


In [None]:
print(f"{len(dfnormchars)} alternative chars which are sometimes used as equivalent visual representations for {len(dfnormchars['NormChar'].unique())} other very frequent chars")

171 alternative chars which are sometimes used as equivalent visual representations for 53 other very frequent chars


Columns :
- Code/Char/CharName : alternative Unicode char often used as a visual equivalent of a more frequent char
- NormCode/NormChar/NormCharName : more frequent char which should be used to normalize text

In [None]:
normalizedchars = {}
for rowidx,row in dfnormchars.iterrows():
    normalizedchars[row["Char"]] = row["NormChar"]

## 2. Text normalization

### 2.1 Normalization functions

We need to apply several replacement functions in a row, each replacement function building on the replacements already applied by the previous ones.

We can't simply use replace statements on immutable strings to do this : we would need to allocate new strings for each replacement at each level, and this would put a high load on the garbage collector.

A better solution is to implement our normalization function as a chain of iterators on chars.

In [None]:
import functools
import itertools

def ignorechars(chariterator, charset):
    for char in chariterator:
        if not char in charset:
            yield char
            
def replacechars1to1(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            yield chardict[char]
        else:
            yield char
            
def replacechars1toN(chariterator, chardict):
    for char in chariterator:
        if char in chardict:
            for outchar in chardict[char]:
                yield outchar
        else:
            yield char

In [None]:
replaceWin1252ErrorChars = functools.partial(replacechars1to1, chardict=win1252errorchars)
ignoreControlChars = functools.partial(ignorechars, charset=controlchars)
replaceLatinLettersSymbols = functools.partial(replacechars1toN, chardict=latinlettersnolayout)
replaceLatinLettersLigatures = functools.partial(replacechars1toN, chardict=latinlettersnoligatures)
replaceLatinNumbersSymbols = functools.partial(replacechars1toN, chardict=latinnumbersnolayout)
replaceNormalizedChars = functools.partial(replacechars1to1, chardict=normalizedchars)

In [None]:
testString = "ABCabcd"

ignoreSet = set(['A','a'])            
ignoreAs = functools.partial(ignorechars, charset=ignoreSet)

result = ignoreAs(testString)
print("".join(result))

replace1to1Dict = {'A':'X','a':'x'}
replaceAs = functools.partial(replacechars1to1, chardict=replace1to1Dict)

result = replaceAs(testString)
print("".join(result))

replace1toNDict = {'B':'XY','b':'xyz'}
replaceBs = functools.partial(replacechars1toN, chardict=replace1toNDict)

result = replaceBs(testString)
print("".join(result))

BCbcd
XBCxbcd
AXYCaxyzcd


To match several chars in an iterator, we have to build a hierarchical dictionary structure.

For example, if we want to implement the following replacements :
```
ABC => 1
ABD => 2
AC  => 3
BC  => 4
```
We build the following dictionary structure :

```
A : { B : { C : 1
            D : 2
            
      C : 3 }
      
B : { C : 4 }
```

In [None]:
def buildhierarchicaldict(idict):
    hdict = {}
    odicts = []
    for key in idict:
        if len(key) > 1:
            firstchar = key[0]
            remainingstring = key[1:]
            if not firstchar in hdict:
                newdict = {}
                hdict[firstchar] = newdict
                odicts.append((firstchar,newdict))
            hdict[firstchar][remainingstring] = idict[key]
    for pkey,odict in odicts: 
        dictwithlongkey = False
        for key in odict:
            if len(key)>1:
                dictwithlongkey = True
                break
        if dictwithlongkey:
            hdict[pkey] = buildhierarchicaldict(odict)
    return hdict

In [None]:
utf8errorshdict = buildhierarchicaldict(utf8errorchars)
# utf8errorshdict

In [None]:
combiningcharshdict = buildhierarchicaldict(combiningchars)
#combiningcharshdict

In [None]:
def replacecharsNto1(chariterator, hierarchicaldict):
    candidatechars = []
    candidatedicts = []
    for char in chariterator:
        # Try to match previously started patterns
        if len(candidatechars)>0:    
            for idx,candidatedict in enumerate(candidatedicts):
                if not candidatedict is None:
                    if char in candidatedict:
                        value = candidatedict[char]
                        if isinstance(value,dict):
                            candidatedicts[idx] = value
                        else:   
                            # Success : found a char to return
                            for ridx in range(0,idx):
                                yield candidatechars[ridx]
                            candidatechars = []
                            candidatedicts = []
                            char = None
                            yield value
                            break
                    else:   
                        candidatedicts[idx] = None
            # Clean oldest failed attemps and return accumulated chars           
            while len(candidatedicts)>0 and candidatedicts[0] is None:
                candidatedicts.pop(0)                  
                yield candidatechars.pop(0)
        # Handle the current char     
        if not char is None:
            if len(candidatechars)==0:
                if char in hierarchicaldict:
                    value = hierarchicaldict[char]
                    if isinstance(value,dict):
                        candidatechars.append(char)
                        candidatedicts.append(value)
                    else:
                        yield value
                else:
                    yield char
            else:
                candidatechars.append(char)
                if char in hierarchicaldict:
                    value = hierarchicaldict[char]
                    candidatedicts.append(value)
                else:
                    candidatedicts.append(value)

In [None]:
replaceUtf8Errors = functools.partial(replacecharsNto1, hierarchicaldict=utf8errorshdict)
replaceCombiningChars = functools.partial(replacecharsNto1, hierarchicaldict=combiningcharshdict)

In [None]:
testString = "XABCDEFDXYEZ"

hdict = {"A": {"B": {"C":'1'}}, "B": {"C":'2'}, "C": {"D":'3'}, "D": {"E":'4'}, "E":'5', "F":'6', "X": {"Y":'0'} } # , "A":'9'
replaceTest = functools.partial(replacecharsNto1, hierarchicaldict=hdict)

result = replaceTest(testString)
print("".join(result))

X146D05Z


#### Unicode normalization pipeline 

In [None]:
def compose(*functions):
    def compose2(f, g):
        return lambda x: f(g(x))
    return functools.reduce(compose2, functions, lambda x: x)

def tostring(iterator):
    return "".join(iterator)

In [None]:
unicodeNorm = compose(tostring, replaceNormalizedChars, replaceLatinNumbersSymbols, replaceLatinLettersLigatures, replaceLatinLettersSymbols, ignoreControlChars, replaceCombiningChars, replaceUtf8Errors, replaceWin1252ErrorChars)

In [None]:
teststring = chr(127995)+"① l`"+chr(156)+"uv"+chr(127)+"re est¨ "+chr(147)+"belle"+chr(148)+"¸ Ã  Â½ â‚¬ énième â€° "+chr(133)+" ⁽🇪ﬃc🇦ce⁾ ！"
teststring

'🏻① l`\x9cuv\x7fre est¨ \x93belle\x94¸ Ã\xa0\xa0Â½ â‚¬ énième â€° \x85 ⁽🇪ﬃc🇦ce⁾ ！'

In [None]:
#[(ord(char),char) for char in unicodeNorm(teststring)]
unicodeNorm(teststring)

"(1) l'oeuvre est «belle», à 1/2 € énième ‰ … (EfficAce) !"

### 2.2 Normalization class with change tracking

In [None]:
latinletterstolower = {}
for rowidx,row in dflatinletters.iterrows():
    if row["Char"] != row["LowerChar"]:
        latinletterstolower[row["Char"]] = row["LowerChar"]

In [None]:
latinlettersnodiacritics = {}
latinlettersremoveddiacritics = {}
for rowidx,row in dflatinletters.iterrows():
    if row["IsDiacritic"]:
        latinlettersnodiacritics[row["Char"]] = row["BaseChar"]
        latinlettersremoveddiacritics[row["Char"]] = row["Diacritics"]

In [None]:
import pandas as pd
from functools import partial
from operator import itemgetter
from io import StringIO

    
class TextNormalizer():
    
    def __init__(self, rootdir):
        
        # 1. Load Unicode character set data for latin script
        chardatadir = rootdir / "libdata" / "chars"
        # 1.1 Frequent encoding errors : windows1252 read as iso8859-1
        dfencodingwin1252 = pd.read_csv(chardatadir / "windows1252-iso8859-errors.csv", sep=";")
        win1252errorchars = {}
        for rowidx,row in dfencodingwin1252.iterrows():
            win1252errorchars[row["Char"]] = row["DecodedChar"]
        # 1.2 Frequent encoding errors : utf8 read as windows1252
        dfencodingutf8 = pd.read_csv(chardatadir / "utf8-windows1252-errors.csv", sep=";")
        utf8errorchars = {}
        for rowidx,row in dfencodingutf8.iterrows():
            utf8errorchars[row["ErrorSubstring"]] = row["DecodedChar"]
        utf8errorshdict = self.buildhierarchicaldict(utf8errorchars)
        # 1.3 Frequent encoding errors : windows1252 read as utf8
        dfencodingwin1252utf8 = pd.read_csv(chardatadir / "windows1252-utf8-errors.csv", sep=";")
        win1252utf8errorchars = {}
        for rowidx,row in dfencodingwin1252utf8.iterrows():
            win1252utf8errorchars[row["Char"]] = row["DecodedChars"]
        # 1.4 Unicode combining chars
        dfcombiningchars = pd.read_csv(chardatadir / "combiningdiacritics.csv", sep=";")
        combiningchars = {}
        for rowidx,row in dfcombiningchars.iterrows():
            combiningchars[row["BaseChar"]+row["Char"]] = row["CombinedChar"]
        combiningcharshdict = self.buildhierarchicaldict(combiningchars)
        # 1.5 Control chars
        dfcontrolchars = pd.read_csv(chardatadir / "controlchars.csv", sep=";")
        dfcontrolchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
        controlchars = set(dfcontrolchars["Char"])
        # 1.6 Latin letter symbols
        dflatinsymbols = pd.read_csv(chardatadir / "latinsymbols.csv", sep=";")
        latinlettersnolayout = {}
        latinlettersremovedlayout = {}
        for rowidx,row in dflatinsymbols.iterrows():
            latinlettersnolayout[row["Char"]] = row["NormString"]
            latinlettersremovedlayout[row["Char"]] = row["Layout"]
        # 1.7 Latin letters
        dflatinletters = pd.read_csv(chardatadir / "latinletters.csv", sep=";")
        latinletterstoupper = {}
        for rowidx,row in dflatinletters.iterrows():
            if row["Char"] != row["UpperChar"]:
                latinletterstoupper[row["Char"]] = row["UpperChar"]
        latinlettersnodiacritics = {}
        latinlettersremoveddiacritics = {}
        for rowidx,row in dflatinletters.iterrows():
            if row["IsDiacritic"]:
                latinlettersnodiacritics[row["Char"]] = row["BaseChar"]
                latinlettersremoveddiacritics[row["Char"]] = row["Diacritics"]
        latinlettersnoligatures = {}
        for rowidx,row in dflatinletters.iterrows():
            if row["IsLigature"]:
                latinlettersnoligatures[row["Char"]] = row["MultiChars"]
        # 1.8 Latin numbers and number symbols
        dflatinnumbers = pd.read_csv(chardatadir / "latinnumbers.csv", sep=";")
        latinnumbersnolayout = {}
        latinnumbersremovedlayout = {}
        for rowidx,row in dflatinnumbers.iterrows():
            if rowidx < 10:
                continue
            latinnumbersnolayout[row["Char"]] = row["NormString"]
            latinnumbersremovedlayout[row["Char"]] = row["Layout"]
        # 1.9 Variations on frequent chars to normalize
        dfnormchars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=";")
        normalizedchars = {}
        for rowidx,row in dfnormchars.iterrows():
            normalizedchars[row["Char"]] = row["NormChar"]
        # 1.10 Optional replacement of cyrillic and greek chars looking like latin letters
        dfcgnormchars = pd.read_csv(chardatadir / "cyrillic-greek-chars.csv", sep=";")
        cgnormalizedchars = {}
        for rowidx,row in dfcgnormchars.iterrows():
            cgnormalizedchars[row["Char"]] = row["NormChar"]
        # 1.11 Final supported french charset
        dfsupportedchars = pd.read_csv(chardatadir / "charset-fr.csv", sep=";", quotechar='"')
        dfsupportedchars.loc[0,"Char"] = chr(0) # chr(0) can't be saved in CSV file
        supportedchars = set(dfsupportedchars["Char"])
    
        # 2.1 List successive transformations    
        self.transformsDescs = []
        transforms = []
        self.transformsDescs.append("Fix encoding errors : windows1252 read as iso8859-1")
        transforms.append(partial(self.replacechars1to1, 0, win1252errorchars))
        self.transformsDescs.append("Fix encoding errors : utf8 read as windows1252")
        transforms.append(partial(self.replacecharsNto1, 1, utf8errorshdict))
        self.transformsDescs.append("Fix encoding errors :  windows1252 read as utf8")
        transforms.append(partial(self.replacechars1toN, 2, win1252utf8errorchars))
        self.transformsDescs.append("Merge Unicode combining chars")
        transforms.append(partial(self.replacecharsNto1, 3, combiningcharshdict))
        self.transformsDescs.append("Ignore control chars")
        transforms.append(partial(self.ignorechars, 4, controlchars))
        self.transformsDescs.append("Replace latin letter symbols")
        transforms.append(partial(self.replacechars1toN, 5, latinlettersnolayout))
        self.transformsDescs.append("Replace latin letter ligatures")
        transforms.append(partial(self.replacechars1toN, 6, latinlettersnoligatures))
        self.transformsDescs.append("Replace latin number symbols")
        transforms.append(partial(self.replacechars1toN, 7, latinnumbersnolayout))
        self.transformsDescs.append("Normalize equivalent chars") 
        transforms.append(partial(self.replacechars1to1, 8, normalizedchars))   
        self.transformsDescs.append("Replace cyrillic and greek chars looking like latin letters") 
        transforms.append(partial(self.replacechars1to1,9, cgnormalizedchars))  
        self.transformsDescs.append("Replace infrequent chars : latin letters with diacritics") 
        transforms.append(partial(self.replacecharsnotinset, 10, supportedchars, latinlettersnodiacritics))  
        self.transformsDescs.append("Replace infrequent chars : other scripts") 
        transforms.append(partial(self.replaceotherscripts, 11, supportedchars))
        self.transformsDescs.append("Replace infrequent chars : symbols") 
        transforms.append(partial(self.replacesymbols, 12, supportedchars)) 
        self.transformsDescs.append("Replace infrequent chars : chars to ignore") 
        transforms.append(partial(self.ignoreotherchars, 13, supportedchars))        
        
        # 2.2 Combine all transformations
        def func(x,y):
            ci = transforms[0](x,y)
            for transform in transforms[1:]:
                ci = transform(ci,y)
            return ci
        self.transformsFunc = func

    def __repr__(self):
        desc = StringIO()
        for idx,transformDesc in enumerate(self.transformsDescs):
            desc.write(f'{idx+1} - {transformDesc}\n')
        return desc.getvalue()
        
    def __call__(self, inputText):
        result = NormResult(inputText, self.transformsDescs)
        result.setOutput(self.tostring(self.transformsFunc(inputText,result)))
        return result
        
    @staticmethod
    def buildhierarchicaldict(idict):
        hdict = {}
        odicts = []
        for key in idict:
            if len(key) > 1:
                firstchar = key[0]
                remainingstring = key[1:]
                if not firstchar in hdict:
                    newdict = {}
                    hdict[firstchar] = newdict
                    odicts.append((firstchar,newdict))
                hdict[firstchar][remainingstring] = idict[key]
        for pkey,odict in odicts: 
            dictwithlongkey = False
            for key in odict:
                if len(key)>1:
                    dictwithlongkey = True
                    break
            if dictwithlongkey:
                hdict[pkey] = TextNormalizer.buildhierarchicaldict(odict)
        return hdict

    @staticmethod
    def ignorechars(layer, charset, chariterator, result):
        for index,char in enumerate(chariterator):
            if not char in charset:
                yield char
            else:
                result.addChange(layer, index, char, '')

    @staticmethod  
    def replacechars1to1(layer, chardict, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in chardict:
                resChar = chardict[char]
                result.addChange(layer, index, char, resChar)
                yield resChar
            else:
                yield char

    @staticmethod  
    def replacechars1toN(layer, chardict, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in chardict:
                resStr = chardict[char]
                result.addChange(layer, index, char, resStr)
                for outchar in resStr:
                    yield outchar
            else:
                yield char

    @staticmethod
    def replacecharsNto1(layer, hierarchicaldict, chariterator, result):
        candidatechars = []
        candidatedicts = []
        for index,char in enumerate(chariterator):
            # Try to match previously started patterns
            if len(candidatechars)>0:    
                for idx,candidatedict in enumerate(candidatedicts):
                    if not candidatedict is None:
                        if char in candidatedict:
                            value = candidatedict[char]
                            if isinstance(value,dict):
                                candidatedicts[idx] = value
                            else:   
                                # Success : found a char to return
                                for ridx in range(0,idx):
                                    yield candidatechars[ridx]
                                replacedStr = "".join(candidatechars[idx:]) + char
                                result.addChange(layer, index-len(replacedStr)+1, replacedStr, value)
                                candidatechars = []
                                candidatedicts = []
                                char = None
                                yield value
                                break
                        else:   
                            candidatedicts[idx] = None
                # Clean oldest failed attemps and return accumulated chars           
                while len(candidatedicts)>0 and candidatedicts[0] is None:
                    candidatedicts.pop(0)                  
                    yield candidatechars.pop(0)
            # Handle the current char  
            if not char is None:
                if len(candidatechars)==0:
                    if char in hierarchicaldict:
                        value = hierarchicaldict[char]
                        if isinstance(value,dict):
                            candidatechars.append(char)
                            candidatedicts.append(value)
                        else:
                            result.addChange(layer, index, char, value)
                            yield value
                    else:
                        yield char
                else:
                    candidatechars.append(char)
                    if char in hierarchicaldict:
                        value = hierarchicaldict[char]
                        candidatedicts.append(value)
                    else:
                        candidatedicts.append(None)     
        if len(candidatechars)>0:
            for char in candidatechars:
                yield char
    
    @staticmethod
    def replacecharsnotinset(layer, charset, replacedict, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in charset:
                yield char
            else:
                if char in replacedict:
                    resChar = replacedict[char]
                    result.addChange(layer, index, char, resChar)
                    yield resChar
                else:
                    yield char            
    
    @staticmethod
    def replaceotherscripts(layer, charset, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in charset:
                yield char
            else:
                family = blockfamily(charblock(char))
                if not family in ("Symbols","Ignore"):
                    resStr = chr(65532) + str(ord(char)) + '_'
                    result.addChange(layer, index, char, resStr)
                    for outchar in resStr:
                        yield outchar
                else:
                    yield char           
    
    @staticmethod
    def replacesymbols(layer, charset, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in charset:
                yield char
            else:
                family = blockfamily(charblock(char))
                if family == "Symbols":
                    resStr ='$' + charname(char).replace(' ','') + '_'
                    result.addChange(layer, index, char, resStr)
                    for outchar in resStr:
                        yield outchar
                else:
                    yield char          
    
    @staticmethod
    def ignoreotherchars(layer, charset, chariterator, result):
        for index,char in enumerate(chariterator):
            if char in charset:
                yield char
            else:
                family = blockfamily(charblock(char))
                if family == "Ignore":
                    result.addChange(layer, index, char, '')
                else:
                    yield char            
    
    @staticmethod
    def tostring(iterator):
        return "".join(iterator)
    
    
class NormResult():
    
    def __init__(self, inputText, transformsDescs):
        self.input, self.transforms = inputText, transformsDescs
        self.layerChanges = None
        self.output = ""
    
    def addChange(self, layer, index, charsInput, charsOutput, removedInfo=None):
        if self.layerChanges is None:
            self.layerChanges = []
        if layer > (len(self.layerChanges)-1):
            for i in range(0,layer-len(self.layerChanges)+1):
                self.layerChanges.append([])
        changes = self.layerChanges[layer]
        change = NormChange(layer,index,charsInput,charsOutput,removedInfo)
        changes.append(change)   
        
    def describeChanges(self):
         if self.layerChanges is None:
            return 'No change'
         else:
            desc = StringIO()
            previousString = self.input
            for changes in self.layerChanges:
                layer = changes[0].layer
                layerDesc = self.transforms[layer]
                desc.write(layerDesc+"\n")                
                dispInparts = []     
                outparts = []
                dispOutparts = []
                lastIndex = 0
                for change in changes:
                    if change.index > lastIndex:
                        samePart = previousString[lastIndex:change.index]
                        dispInparts.append(samePart)
                        outparts.append(samePart)
                        dispOutparts.append(samePart) 
                    dispInpart = change.input
                    outpart = change.output
                    dispOutpart = outpart
                    if len(dispInpart)>len(outpart):
                        dispOutpart = outpart + ("_"*(len(dispInpart)-len(outpart)))
                    elif len(outpart)>len(dispInpart):
                        dispInpart = dispInpart + (" "*(len(outpart)-len(dispInpart)))
                    dispInparts.append(' ['+dispInpart+'] ')
                    outparts.append(outpart)
                    dispOutparts.append(' ['+dispOutpart+'] ')
                    lastIndex = change.index + len(change.input)
                if lastIndex < len(previousString):
                    samePart = previousString[lastIndex:]
                    dispInparts.append(samePart)
                    outparts.append(samePart)
                    dispOutparts.append(samePart)
                previousString = "".join(outparts)
                desc.write(" < ")
                for inpart in dispInparts:
                    desc.write(inpart)
                desc.write('\n')
                desc.write(" < ")
                for outpart in dispOutparts:
                    desc.write(outpart)
                desc.write('\n')
            return desc.getvalue()
            
    def mapOutputIndexToInput(self,outputIndex):
        inputIndex = outputIndex
        for changes in self.layerChanges:
            outputIndex = inputIndex
            for change in changes:
                if outputIndex < change.index:
                    break
                elif outputIndex > (change.index + len(change.output)):
                    inputIndex = inputIndex + (len(change.input)-len(change.output))
                else:
                    inputIndex = inputIndex -(outputIndex-change.index)
                    break
        return inputIndex        
            
    def setOutput(self, outputText):
        self.output = outputText
        
    def __repr__(self):
        return self.output
    
class NormChange():
    
    def __init__(self, layer, index, charsInput, charsOutput, removedInfo=None):
        self.layer, self.index, self.input, self.output, self.removedInfo = layer, index, charsInput, charsOutput, removedInfo
        
    def __repr__(self):
        return f"{self.layer} - {self.index} : {self.input} => {self.output}"

In [None]:
%time norm = TextNormalizer(rootdir)
norm

Wall time: 1.35 s


1 - Fix encoding errors : windows1252 read as iso8859-1
2 - Fix encoding errors : utf8 read as windows1252
3 - Fix encoding errors :  windows1252 read as utf8
4 - Merge Unicode combining chars
5 - Ignore control chars
6 - Replace latin letter symbols
7 - Replace latin letter ligatures
8 - Replace latin number symbols
9 - Normalize equivalent chars
10 - Replace cyrillic and greek chars looking like latin letters
11 - Replace infrequent chars : latin letters with diacritics
12 - Replace infrequent chars : other scripts
13 - Replace infrequent chars : symbols
14 - Replace infrequent chars : chars to ignore

In [None]:
teststring = chr(127995)+"① l`"+chr(156)+"uv"+chr(127)+"re est¨ "+chr(147)+"belle"+chr(148)+"¸ Ã  Â½ â‚¬ énième â€° "+chr(133)+" ⁽🇪ﬃc🇦ce⁾ ！"
teststring

'🏻① l`\x9cuv\x7fre est¨ \x93belle\x94¸ Ã  Â½ â‚¬ énième â€° \x85 ⁽🇪ﬃc🇦ce⁾ ！'

In [None]:
result = norm(teststring)
result

(1) l'oeuvre est «belle», Ã  1/2 € énième ‰ … (EfficAce) !

In [None]:
print(result.describeChanges())

IndexError: list index out of range

In [None]:
result.output[0:12]

"(1) l'oeuvre"

In [None]:
result.input[result.mapOutputIndexToInput(0):result.mapOutputIndexToInput(12)]

'🏻① l`\x9cuv\x7fre'

In [None]:
result.output[3:10]

" l'oeuv"

In [None]:
result.input[result.mapOutputIndexToInput(3):result.mapOutputIndexToInput(10)]

' l`\x9cuv\x7f'

In [None]:
%timeit -n100 norm(teststring)

197 µs ± 19.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
