# frenchtext

> NLP library to process french text.

In this early pre-version, the library provides :
- datasets to train business-oriented french text models
- a characters normalization pipeline tailored for french text

## Install

`pip install frenchtext`

## Dependencies

- [pandas](https://pandas.pydata.org/)
- [pyarrow](https://arrow.apache.org/docs/python/)
- [requests](https://requests.readthedocs.io/en/latest/)
- [fastprogress](https://github.com/fastai/fastprogress)

## Licence

APACHE licence 2.0 : https://www.apache.org/licenses/LICENSE-2.0

## How to use

The detailed documentation for each module is available through the menu on the left side of this page.

You will find below an overview of the library.

## French datasets

### Data sources

The text content of the main french websites in the domain of finance and business (+ wikipedia) was extracted in september 2019 using [nlptextdoc](https://github.com/laurentprudhon/nlptextdoc).

This extraction was done as "politely" as possible:
- extract only freely and publicly available content
- respect the robots.txt directives of each website (pages forbidden for indexing, maximum extraction rate)
- detect when websites use tools to prevent indexing (like Datadome) and abort the crawl

**IMPORTANT: The original authors of the websites own the copyright on all text blocks in this dataset.**

To be able to link each text block to its original author, we track the origin URL of each text block throughout the whole process.

**YOU CAN'T REUSE THE TEXT BLOCKS FOR ANY PURPOSE EXCEPT TRAINING A NATURAL LANGUAGE PROCESSING MODEL.**

See the new European copyright rules : [European Parliament approves new copyright rules for the internet](https://www.europarl.europa.eu/news/en/headlines/priorities/copyright/20190321IPR32110/european-parliament-approves-new-copyright-rules-for-the-internet)

"*The directive aims to make it easier for copyrighted material to be used freely through text and data mining, thereby removing a significant competitive disadvantage that European researchers currently face.*"

=> 131 websites and 2 564 755 HTML pages

### Data preparation

The text blocks were then:
- deduplicated to keep only distinct text blocks for each website (forgetting part of the original document structure), 
- tagged (but not filtered) by language (using https://fasttext.cc/docs/en/language-identification.html),
- grouped in categories according to the main theme of the original website,
- split in [Pandas](https://pandas.pydata.org/) dataframes of size < 2 GB.

=> 10 categories: 'Assurance', 'Banque', 'Bourse', 'Comparateur', 'Cr√©dit', 'Forum', 'Institution', 'Presse', 'SiteInfo', 'Wikipedia'

In each dataframe, the text blocks were additionnaly **SHUFFLED IN A RANDOM ORDER** to make it very difficult to reconstruct the original articles (safety measure to help protect the copyrights of the authors).

The results of this second step can be downloaded in the *config.datasets* directory, as dataframes serialized in the [feather format](https://arrow.apache.org/docs/python/ipc.html?highlight=feather#feather-format), in files named according to the 'DatasetFile' column of the datasets table.

=> 19 dataset files: 'assurance', 'banque', 'bourse', 'comparateur', 'cr√©dit', 'forum', 'institution', 'presse-1', 'presse-2', 'presse-3', 'presse-4', 'presse-5', 'presse-6', 'siteinfo', 'wikipedia-1', 'wikipedia-2', 'wikipedia-3', 'wikipedia-4', 'wikipedia-5'

### Dataset size

The number of words in each text block was computed using the default french tokenizer from [spaCy](https://spacy.io/) v2.1.

This business-oriented dataset contains **2 billion french words**.

Here is a summary of the number of words contributed by each category **in millions**:

- Assurance : 12
- Banque : 20
- Bourse : 26
- Comparateur :	20
- Cr√©dit : 1
- Forum : 152
- Institution : 4
- Presse : 963
- SiteInfo : 78
- Wikipedia : 727

### Dataset files

In [None]:
from frenchtext.core import *
from frenchtext.datasets import *

List available dataset files :

In [None]:
datasetfiles = list_dataset_files()
datasetfiles

['assurance',
 'banque',
 'bourse',
 'comparateur',
 'cr√©dit',
 'forum',
 'institution',
 'presse-1',
 'presse-2',
 'presse-3',
 'presse-4',
 'presse-5',
 'presse-6',
 'siteinfo',
 'wikipedia-1',
 'wikipedia-2',
 'wikipedia-3',
 'wikipedia-4',
 'wikipedia-5']

Source websites and number of words in each dataset file :

In [None]:
datasetsdf = list_datasets()
datasetsdf[["DatasetFile","Url","Pages","Words"]].iloc[80:100]

Unnamed: 0,DatasetFile,Url,Pages,Words
80,comparateur,https://www.panorabanques.com/,4341,2584038
81,cr√©dit,https://www.cetelem.fr/,274,157191
82,cr√©dit,https://www.cofidis.fr/,347,243904
83,cr√©dit,https://www.cofinoga.fr/,413,86796
84,cr√©dit,https://www.sofinco.fr/,916,597221
85,cr√©dit,https://www.younited-credit.com/,1341,665115
86,forum,https://droit-finances.commentcamarche.com/,96450,56120562
87,forum,http://forum.doctissimo.fr/famille/argent-budg...,26981,61020453
88,forum,http://forum.doctissimo.fr/viepratique/finance...,5745,4962230
89,forum,http://forum.doctissimo.fr/viepratique/Impots/...,2338,1422143


### Download dataset files

In [None]:
download_dataset_file("assurance")

Downloading dataset file : assurance (17 MB)


In [None]:
download_all_datasets()

Downloading dataset file : assurance (17 MB)
Downloading dataset file : banque (28 MB)
Downloading dataset file : bourse (38 MB)
Downloading dataset file : comparateur (28 MB)
Downloading dataset file : cr√©dit (2 MB)
Downloading dataset file : forum (220 MB)
Downloading dataset file : institution (5 MB)
Downloading dataset file : presse-1 (218 MB)
Downloading dataset file : presse-2 (196 MB)
Downloading dataset file : presse-3 (190 MB)
Downloading dataset file : presse-4 (234 MB)
Downloading dataset file : presse-5 (269 MB)
Downloading dataset file : presse-6 (334 MB)
Downloading dataset file : siteinfo (116 MB)
Downloading dataset file : wikipedia-1 (131 MB)
Downloading dataset file : wikipedia-2 (182 MB)
Downloading dataset file : wikipedia-3 (263 MB)
Downloading dataset file : wikipedia-4 (269 MB)
Downloading dataset file : wikipedia-5 (267 MB)


You can change the local directory where the dataset files are downloaded :

In [None]:
config.datasets

PosixPath('/home/laurent/.frenchtext/datasets')

In [None]:
config["datasets_path"] = "/tmp/datasets"
config.datasets.mkdir(parents=True, exist_ok=True)

In [None]:
config.datasets

PosixPath('/tmp/datasets')

### Read dataset files

In [None]:
datasetdf = read_dataset_file("assurance")
datasetdf

Loaded dataframe for dataset assurance : 563613 text blocks


Unnamed: 0,Website,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
0,11,22332,ListItem,Text,2,5 tournages catastrophe pour un assureur,fr,6,True
1,74,710,Section,Start,1,Tout connaitre sur la nouvelle formation post-...,fr,7,True
2,11,12082,TextBlock,Text,1,Votre Agent Mandataire AXA - Civry Marie Claud...,?,18,True
3,87,461,TextBlock,Text,4,60 ans et 4 mois,fr,5,True
4,7,200,TextBlock,Text,1,Mon devis sur mesure,fr,4,True
...,...,...,...,...,...,...,...,...,...
563608,138,255,Section,Start,2,Les autres pouvoirs de police,fr,5,True
563609,11,19483,TextBlock,Text,1,Yves Nicolau assurance Laon,?,4,True
563610,106,1644,ListItem,Text,3,Ev√®nements sportifs,fr,2,True
563611,58,4155,Section,Start,1,Agence Groupama Chalon,?,3,True


### Access text blocks in dataset files

Filter and iterate over the rows of a dataset file :

In [None]:
rowsiterator = get_rows_from_datasetdf(datasetdf, minwords=None, maxwords=5, lang="?")
show_first_rows(rowsiterator,10)

12 - COORDONNEES
41 - 01 30 41 67 33
49 - Dmitriy G.
57 - Les atouts du Multisupport CONFIANCE
74 - 01XXL meribel hiver
76 - Garantie en cas de vol
87 - Par AXA, le 01/08/2016
96 - mgr@enderby.eu
127 - 18 place De Strasbourg
131 - Saint Gaudens


Filter and iterate over the text blocks of a full dataset (across multiple files) :

In [None]:
textiterator = get_textblocks_from_dataset("Assurance", minwords=None, maxwords=10, lang="fr")
show_first_textblocks(textiterator,skip=2000,count=10)

Loaded dataframe for dataset assurance : 563613 text blocks
2001 - R√©√©quipement √† neuf √† vie
2002 - D√©finition Conducteur secondaire- Lexique
2003 - Comment √©viter les fraudes
2004 - Comment demander un remboursement sant√© - GENERALI
2005 - Simulateur pour conna√Ætre les obligations de votre accord de branche
2006 - Compl√©mentaire Epargne retraite des ind√©pendants et TNS - Malakoff M√©d√©ric
2007 - Experts-Comptables, d√©couvrez la mission √©pargne salariale
2008 - Vous n‚Äô√™tes pas encore client :
2009 - Actualit√©s (Page 6) | ameli.fr | Pharmacien
2010 - D√©pression : quelle prise en charge ? - Matmut


Access a specific row :

In [None]:
get_text_from_rowindex(datasetdf,100)

'Les inondations de plaine : d√©bordement de cours d‚Äôeau avec une dur√©e d‚Äôimmersion longue (pr√©visibles plusieurs jours ou heures √† l‚Äôavance).'

Find text blocks with a specific char or substring :

In [None]:
find_textblocks_with_chars(datasetdf,"r√©troviseur",count=20,ctxsize=15)

350594     ore dans notre r√©troviseur gauche lorsque 
149029     de glace ? Les r√©troviseurs ainsi que les 
51349      ace. Quant aux r√©troviseurs, ils le sont d
310354     vant, arri√®re, r√©troviseurs et vitres lat√©
489866    \naussi dans le r√©troviseur pour ne pas se 
364550     √¥t√© ou sous le r√©troviseur int√©rieur de vo
560539     tionnement des r√©troviseurs.              
560700     √© (pare-brise, r√©troviseurs‚Ä¶),            
223621     riorations des r√©troviseurs et des phares.
543903     es miroirs des r√©troviseurs lorsqu‚Äôils peu
502075      logo dans son r√©troviseur et par un signa
53237      vous cassez le r√©troviseur d‚Äôune voiture. 
310456      √©raflures, un r√©troviseur ab√Æm√©, ou un au
375158     ant, moteur de r√©troviseurs‚Ä¶              
539914     nt et arri√®re, r√©troviseurs int√©rieurs et 
171367     t utilisez vos r√©troviseurs               
485058      ainsi que les r√©troviseurs ne sont pas ga
277390     ant, moteur de r√©troviseurs...    

In [None]:
find_textblocks_with_chars(datasetdf,64257,count=10,wrap=True)

175413    x besoins de diversi[Ô¨Å]cation des placements
337398    e 30 villes ont b√©n√©[Ô¨Å]ci√© de ces animations
265114    nt r√®glementaire et [Ô¨Å]nancier, nous accompa
74267          La Fondation a [Ô¨Å]nanc√© depuis 2009, l‚Äô
424584    tion de l‚Äô√©quilibre [Ô¨Å]nancier des r√©gimes d
219195    d, J√©r√¥me Powell con[Ô¨Å]rmera que, dans l‚Äôatt
489511    s besoins de diversi[Ô¨Å]cation de la client√®l
517563    si en pr√©sence d‚Äôun [Ô¨Å]nancement par cr√©dit,
479694    nt r√®glementaire et [Ô¨Å]nancier, La Mondiale 
252202    n de disponibilit√©s [Ô¨Å]nanci√®res mais aussi,
Name: Text, dtype: object

### Track the source URL for each text block 

Optionally download and read urls file to track the origin of each text block :

In [None]:
urlsdf = read_urls_file()
urlsdf.head()

Loaded datasets urls : 2668787 urls


Unnamed: 0,Website,DocId,DocUrl,Words,fr,en,de,es,?,%fr,%en,%de,%es,%?
0,4,1,https://www.afer.fr/,573.0,524.0,3.0,0.0,0.0,46.0,0.914485,0.005236,0.0,0.0,0.080279
1,4,2,https://www.afer.fr/afer/adhesion/,74.0,74.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,4,3,https://www.afer.fr/afer/adhesion/adherent-ass...,475.0,457.0,5.0,0.0,0.0,13.0,0.962105,0.010526,0.0,0.0,0.027368
3,4,4,https://www.afer.fr/afer/adhesion/adherer-assu...,519.0,519.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,4,5,https://www.afer.fr/afer/adhesion/parrainage-a...,355.0,345.0,0.0,0.0,0.0,10.0,0.971831,0.0,0.0,0.0,0.028169


In [None]:
get_text_from_rowindex(datasetdf,100)

'Les inondations de plaine : d√©bordement de cours d‚Äôeau avec une dur√©e d‚Äôimmersion longue (pr√©visibles plusieurs jours ou heures √† l‚Äôavance).'

In [None]:
get_url_from_rowindex(datasetdf, 100)

'https://www.maif.fr/conseils-prevention/risques-majeurs/inondation.html'

## Characters normalization pipeline

### Motivation

French datasets often contain several thousands distinct Unicode characters.

Characters stats in Wikipedia dataset :
- 35.6 billion chars
- 13 502 distinct Unicode chars

Characters stats in Business dataset :
- 27.5 billion chars
- 3 763 distinct Unicode chars

We need to reduce the number of distinct characters fed to our natural language processing applications, for three reasons :
- chars considered by the user as visually equivalent will often produce a different application behavior : this is a huge problem for the user experience
- with so many chars, the designer of the NLP application will not be able to reason about all possible combinations : this could harm the explainability of the system
- this huge number of distinct characters brings a significant amount complexity the NLP models will have to deal with

Characters stats in Wikipedia dataset :
- Only 1316 chars more frequent than 1 in 100 million
- 99.9987 % of Wikipedia chars would be preserved if we only kept the frequent chars

Characters stats in Business dataset :
- Only 531 chars more frequent than 1 in 100 million
- 99.9996 % of Business chars would be preserved if we only kept the frequent chars

We can be smarter than that and replace rare chars with equivalent (or mostly equivalent) more frequent chars to preserve a maximum of information.

### Target characters set

After a detailed study of all the frequent chars, the goal is to design a noramization pipeline which can retain as much information as possible while greatly reducing the number of dinstinct chars.

We saw before that it is possible to preserve 99.9996% of the original chars while keeping only 500 distinct chars. By being clever and replacing equivalent chars, we can divide this number by 2 and still retain the same amount of information.

It may then be useful to limit the number of distinct characters after normalization to **255 distinct characters** : 
- if needed, french text chars can then be encoded with a single byte
- the list of supported chars can be memorized by NLP application developers and users

In [None]:
from frenchtext.core import *
from frenchtext.chars import *

255 supported characters after normalization : 

In [None]:
import pandas as pd
dfcharsnorm = pd.read_csv(chardatadir / "charset-fr.csv", sep=";")
dfcharsnorm

Unnamed: 0,FrCode,Category,SubCategory,Code,Char,CharName,CountBusiness
0,0,separator,control,0,,Reserved - End of string,0
1,1,separator,space,32,,Space,88494564
2,2,separator,space,10,\n,Char 10,9588147
3,3,separator,space,9,\t,Char 9,1522053
4,4,separator,punctuation,44,",",Comma,286106887
...,...,...,...,...,...,...,...
251,251,emoticon,object,9792,‚ôÄ,Female Sign,515
252,252,emoticon,object,127881,üéâ,Party Popper,356
253,253,emoticon,object,9997,‚úç,Writing Hand,157
254,254,emoticon,object,9993,‚úâ,Envelope,55


The table below shows the number of chars in each category (after normalization) **per 100 million characters** :

In [None]:
dfblocks = dfcharsnorm.groupby(by=["Category","SubCategory"]).agg({"Char":["count","sum"],"CountBusiness":"sum"})
dfblocks["CountBusiness"] = (dfblocks["CountBusiness"] / 27577304956 * 100000000).astype(int)
dfblocks

Unnamed: 0_level_0,Unnamed: 1_level_0,Char,Char,CountBusiness
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,sum
Category,SubCategory,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
emoticon,hand,12,üí™üëâüëçüëèüôèüôåüëáüëäüëéüëå‚úå‚úä,42
emoticon,head,28,üôÇüòâüòÄüòÇüòÅüòäüôÅüòÖüòçüòÉüò°ü§£üòÑü§îüòéüò≠üëπüò±üòúüòãü§©üôÑüòÜüòõü§™üò¢üòáü§¶,233
emoticon,object,16,‚ö†üî¥üî•üèÜ‚öΩüí°üö®üí•‚ö°‚ô´‚ôÇ‚ôÄüéâ‚úç‚úâ‚úù,60
letter,digit,10,0123549876,3271115
letter,encoding,3,√ÉÔøΩÔøº,249
letter,greek,2,ŒªœÄ,2
letter,latin-fr,84,abcdefghijklmnopqrstuvwxyz√†√¢√§√ß√®√©√™√´√Æ√Ø√¥√∂√π√ª√º√øABCD...,91437146
letter,latin-other,25,√°√£√•ƒáƒçƒóƒüƒ±√≠√¨≈Ñ√±√≥√≤√µ√∏≈°≈ü√ü√∫√Å√Ö≈†√ö≈Ω,712
letter,other,5,_&@\#,40814
separator,control,0,0,0


### Normalization pipeline overview

The normalization pipeline applies the following **14 steps**, which are explained and illustrated in the sections below.

- Fix encoding errors
  - fix windows1252 text read as iso8859-1
  - fix utf8 text read as windows1252
  - fix windows1252 text read as utf8
  - merge Unicode combining chars
  - ignore control chars
- Remove display attributes
  - replace latin letter symbols
  - replace latin letter ligatures
  - replace latin number symbols
- Normalize visually equivalent chars
  - replace equivalent chars 
  - replace cyrillic and greek chars looking like latin letters
- Encode infrequent chars while losing a little bit of information 
  - replace infrequent latin letters with diacritics
  - replace infrequent chars from other scripts
  - replace infrequent symbols 
  - ignore remaining chars with no glyph 

The statistics below count the number of chars normalized **for 1 million chars** in 4 distinct parts of the french datasets : business websites, forums, news, wikipedia.

The first line of the table below shows that :
- in 1 million chars extracted from forum pages (raw users input), 41.8 chars will be encoding errors (windows1252 read as iso8859-1)
- in 1 million chars extracted from wikipedia (curated content), only 0.006 chars will be encoding errors

These numbers show that **characters normalization is much more important in real world applications** than in academic papers based on clean wikipedia text. 

In [None]:
normstats = pd.read_csv(chardatadir / "stats" / "normalization.total.stats.csv")
normstats[["Transform","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

Unnamed: 0,Transform,FreqBusiness,FreqForum,FreqPresse,FreqWikipedia
0,Fix encoding errors : windows1252 read as iso8...,0.51056,41.818746,0.813485,0.006025
1,Fix encoding errors : utf8 read as windows1252,0.126815,0.058024,0.072456,0.001037
2,Fix encoding errors : windows1252 read as utf8,0.0,0.0,0.019315,0.0
3,Merge Unicode combining chars,2.811983,0.432638,0.568146,0.00014
4,Ignore control chars,6.450737,349.052995,6.454367,4.118586
5,Replace latin letter symbols,0.01936,0.039701,0.297372,0.15055
6,Replace latin letter ligatures,6.603815,6.54148,10.09729,17.204422
7,Replace latin number symbols,2.528338,4.162482,2.560933,0.429792
8,Normalize equivalent chars,814.327384,1248.410777,684.33373,242.391239
9,Replace cyrillic and greek chars looking like ...,0.062432,0.760424,0.491996,7.479907


Most frequent chars replaced from equivalent characters :

In [None]:
replacestats = pd.read_csv(chardatadir / "stats" / "normalization.layer8.stats.csv")
replacestats[["Char","CharName","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]].head(20)

Unnamed: 0,Char,CharName,FreqBusiness,FreqForum,FreqPresse,FreqWikipedia
0,',Apostrophe,486.034805,160.264219,376.104982,134.658673
1,,Space,310.411117,1082.845985,288.635983,87.877649
2,-,Hyphen-Minus,14.431203,2.903761,12.828203,16.223154
3,¬´,Left-Pointing Double Angle Quotation Mark,1.429478,0.680513,3.002426,0.559632
4,¬ª,Right-Pointing Double Angle Quotation Mark,1.323524,0.533926,2.46188,0.544134
5,|,Vertical Line,0.003452,0.001018,0.005488,0.875894
6,‚Ä¢,Bullet,0.204104,0.243295,0.189664,0.543237
7,.,Full Stop,0.05928,0.078893,0.85623,0.069278
8,"""",Quotation Mark,0.085093,0.023413,0.011504,0.292385
9,:,Colon,0.00015,0.000509,5.3e-05,0.169047


For example, list of all Unicode chars wich will be projected to a regular 'apostrophe' :

In [None]:
replacechars = pd.read_csv(chardatadir / "normalizedchars.csv", sep=';')
replacechars[replacechars["NormChar"]=="'"][["Code","Char","CharName"]]

Unnamed: 0,Code,Char,CharName
23,96,`,Grave Accent
24,180,¬¥,Acute Accent
25,697,π,Modifier Letter Prime
26,699,ª,Modifier Letter Turned Comma
27,700,º,Modifier Letter Apostrophe
28,702,æ,Modifier Letter Right Half Ring
29,703,ø,Modifier Letter Left Half Ring
30,712,Àà,Modifier Letter Vertical Line
31,714,Àä,Modifier Letter Acute Accent
32,715,Àã,Modifier Letter Grave Accent


Frequency of characters from other scripts (chinese, arabic, cyrillic ...) :

In [None]:
scriptsstats = pd.read_csv(chardatadir / "stats" / "normalization.layer11.stats.csv")
scriptsstats[["CharFamily","FreqBusiness","FreqForum","FreqPresse","FreqWikipedia"]]

Unnamed: 0,CharFamily,FreqBusiness,FreqForum,FreqPresse,FreqWikipedia
0,ChineseJapaneseKorean,0.012456,0.177127,0.194677,4.059173
1,Arabic,0.012306,0.026467,0.46028,3.14012
2,Cyrillic,0.024462,0.166438,0.237159,3.118961
3,Greek,0.016058,0.022904,0.031347,2.423996
4,Hebrew,0.00015,0.0,0.184914,1.132155
5,Other,0.00075,0.029012,0.004063,0.800871
6,Indian,0.00075,0.037665,0.033458,0.737955
7,Phonetic,0.002401,0.001527,0.001636,0.298579
8,Latin,0.013507,0.006108,0.007283,0.269377
9,Math,0.001801,0.000509,0.000528,0.240707


### Normalization pipeline API

Initialize a text normalizer :

In [None]:
%time norm = TextNormalizer()
norm

CPU times: user 1.83 s, sys: 15.6 ms, total: 1.84 s
Wall time: 2 s


1 - Fix encoding errors : windows1252 read as iso8859-1
2 - Fix encoding errors : utf8 read as windows1252
3 - Fix encoding errors :  windows1252 read as utf8
4 - Merge Unicode combining chars
5 - Ignore control chars
6 - Replace latin letter symbols
7 - Replace latin letter ligatures
8 - Replace latin number symbols
9 - Normalize equivalent chars
10 - Replace cyrillic and greek chars looking like latin letters
11 - Replace infrequent chars : latin letters with diacritics
12 - Replace infrequent chars : other scripts
13 - Replace infrequent chars : symbols
14 - Replace infrequent chars : chars to ignore

Normalize text :

In [None]:
teststring = chr(127995)+"‚ë† l`"+chr(156)+"uv"+chr(127)+"re est¬® "+chr(147)+"belle"+chr(148)+"¬∏ √É  √Ç¬Ω √¢‚Äö¬¨ eÃÅnieÃÄme √¢‚Ç¨¬∞ "+chr(133)+" ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ"
teststring

'üèª‚ë† l`\x9cuv\x7fre est¬® \x93belle\x94¬∏ √É  √Ç¬Ω √¢‚Äö¬¨ eÃÅnieÃÄme √¢‚Ç¨¬∞ \x85 ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ'

In [None]:
result = norm(teststring)
result

(1) l'oeuvre est ¬´belle¬ª, √É  1/2 ‚Ç¨ √©ni√®me ‚Ä∞ ‚Ä¶ (EfficAce) !

Describe the changes applied by the normalization pipeline :

In [None]:
print(result.describeChanges())

Fix encoding errors : windows1252 read as iso8859-1
 < üèª‚ë† l` [¬ú] uvre est¬®  [¬ì] belle [¬î] ¬∏ √É  √Ç¬Ω √¢‚Äö¬¨ eÃÅnieÃÄme √¢‚Ç¨¬∞  [¬Ö]  ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
 < üèª‚ë† l` [≈ì] uvre est¬®  [‚Äú] belle [‚Äù] ¬∏ √É  √Ç¬Ω √¢‚Äö¬¨ eÃÅnieÃÄme √¢‚Ç¨¬∞  [‚Ä¶]  ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
Fix encoding errors : utf8 read as windows1252
 < üèª‚ë† l`≈ìuvre est¬® ‚Äúbelle‚Äù¬∏ √É   [√Ç¬Ω]   [√¢‚Äö¬¨]  eÃÅnieÃÄme  [√¢‚Ç¨¬∞]  ‚Ä¶ ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
 < üèª‚ë† l`≈ìuvre est¬® ‚Äúbelle‚Äù¬∏ √É   [¬Ω_]   [‚Ç¨__]  eÃÅnieÃÄme  [‚Ä∞__]  ‚Ä¶ ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
Merge Unicode combining chars
 < üèª‚ë† l`≈ìuvre est¬® ‚Äúbelle‚Äù¬∏ √É  ¬Ω ‚Ç¨  [eÃÅ] ni [eÃÄ] me ‚Ä∞ ‚Ä¶ ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
 < üèª‚ë† l`≈ìuvre est¬® ‚Äúbelle‚Äù¬∏ √É  ¬Ω ‚Ç¨  [√©_] ni [√®_] me ‚Ä∞ ‚Ä¶ ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
Ignore control chars
 <  [üèª] ‚ë† l`≈ìuv [] re est [¬®]  ‚Äúbelle‚Äù¬∏ √É  ¬Ω ‚Ç¨ √©ni√®me ‚Ä∞ ‚Ä¶ ‚ÅΩüá™Ô¨Écüá¶ce‚Åæ ÔºÅ
 <  [_] ‚ë† l`≈ìuv [_] re est [_]  ‚Äúbelle‚Äù¬∏ √É  ¬Ω ‚Ç¨ √©ni

Compute spans for equivalent substrings before and after normalization :

In [None]:
result.output[0:12]

"(1) l'oeuvre"

In [None]:
result.input[result.mapOutputIndexToInput(0):result.mapOutputIndexToInput(12)]

'üèª‚ë† l`\x9cuv\x7fre'

In [None]:
result.output[3:10]

" l'oeuv"

In [None]:
result.input[result.mapOutputIndexToInput(3):result.mapOutputIndexToInput(10)]

' l`\x9cuv\x7f'

Performance test : **2500 sentences per second** => fast enough but will be optimized in a later version.

In [None]:
%timeit -n100 norm(teststring)

397 ¬µs ¬± 89.3 ¬µs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)


### Appendix : Unicode utility functions

Unicode characters properties :

In [None]:
charname("üôÇ")

'Slightly Smiling Face'

In [None]:
charcategory("üôÇ")

'Symbol'

In [None]:
charsubcategory("üôÇ")

'Other'

In [None]:
charblock("üôÇ")

'Emoticons'

In [None]:
blockfamily('Emoticons')

'Symbols'