# Spacy tokenization

## Algorithm

https://spacy.io/usage/linguistic-features#how-tokenizer-works

1. The text is split on **whitespace characters**, similar to text.split(' '). 

*Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”.*

2. The tokenizer processes the text from left to right : on each substring, it performs three checks


3. Does the substring match patterns of **tokens that should never be split** ?

*For example URLs or numbers.*

4. Does the substring match **special cases** of the tokenizer ? 

*For example, “don’t” doesn't contain a whitespace, but should be split into two tokens : “do” and “n’t”, while “U.K.” should always remain one token.*

5. Can a **prefix**, **suffix** or **infix** be split off (in this order) ?

*For example punctuation like commas, periods, hyphens or quotes.*

6. If we consume a prefix, suffix or infix, go back to step 3 (so that exeptions always get priority).


7. Return a token if there is no more prefix, suffix or infix to consume.

https://github.com/explosion/spaCy/blob/master/spacy/tokenizer.pyx

## Configuration

https://spacy.io/usage/linguistic-features#native-tokenizers

Configuration elements :

- Special cases : contractions, units of measurement, emoticons, certain abbreviations, etc.
- Preceding punctuation	: open quotes, open brackets, ...
- Succeeding punctuation : commas, periods, close quotes, ...
- Infixes :	non-whitespace separators, such as hyphens ...
- Boolean function token_match : matching strings that should never be split, overriding the previous rules. Useful for things like URLs or numbers.

Two levels of configuration :

- Base data : char classes, prefixes/suffixes/infixes, tokenizer exceptions
- Language data (en,fr,de,es) : prefixes/suffixes/infixes, tokenizer exceptions

## Base data

### Char classes

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

https://unicode-table.com/en/blocks/

```
LATIN_BASIC = _latin_standard + _latin_standard_fullwidth + _latin_supplement + _latin_extendedA      

LATIN_LOWER_BASIC, LATIN_UPPER_BASIC

LATIN = LATIN_BASIC + _latin_extendedB + _latin_extendedC + _latin_extendedD + _latin_extendedE + _latin_phonetic + _latin_diacritics         

LATIN_LOWER, LATIN_UPPER

ALPHA = LATIN + _russian + _tatar + _greek + _ukrainian + _bengali + _hebrew + _persian + _sinhala + _hindi

ALPHA_LOWER, ALPHA_UPPER

UNITS = km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм ...non latin...

CURRENCY = $ £ € ¥ ฿ US$ C$ A$ ₽ ﷼ ₴

QUOTES = ' " ” “ ` ‘ ´ ’ ‚ , „ » « 「 」 『 』 （ ） 〔 〕 【 】 《 》 〈 〉

PUNCT = … …… , : ; ! ? ¿ ؟ ¡ ( ) [ ] { } < > _ # \* & 。 ？ ！ ， 、 ； ： ～ · । ، ۔ ؛ ٪

HYPHENS = - – — -- --- —— ~

ELLIPSES = ..+ …

ICONS = Various symbols like dingbats, but also emoji
```

Notes :
- CURRENCY - unicode symbols : 36 Dollar Sign | 163 Pound Sign | 8364 Euro Sign | 165 Yen Sign | 3647 Thai Currency Symbol Baht | 8381 Ruble Symbol | 65020 Rial Sign | 8372 Hryvnia Sign 

=> CURRENCY misses many symbols from https://unicode-table.com/en/blocks/currency-symbols/


- QUOTES - unicode symbols : 39 Apostrophe | 34 Quotation Mark | 8221 Right Double Quotation Mark | 8220 Left Double Quotation Mark | 96 Grave Accent | 180 Acute Accent | 8216 Left Single Quotation Mark | 8217 Right Single Quotation Mark | 8218 Single Low-9 Quotation Mark | 44 Comma | 8222 Double Low-9 Quotation Mark | 187 Right-Pointing Double Angle Quotation Mark | 171 Left-Pointing Double Angle Quotation Mark | 12300 Left Corner Bracket | 12301 Right Corner Bracket | 12302 Left White Corner Bracket | 12303 Right White Corner Bracket | 65288 Fullwidth Left Parenthesis | 65289 Fullwidth Right Parenthesis | 12308 Left Tortoise Shell Bracket | 12309 Right Tortoise Shell Bracket | 12304 Left Black Lenticular Bracket | 12305 Right Black Lenticular Bracket | 12288 Left Double Angle Bracket | 12299 Right Double Angle Bracket | 12296 Left Angle Bracket | 12297 Right Angle Bracket


- PUNCT - unicode symbols : 8230 Horizontal Ellipsis (1x 2x) | 44 Comma | 58 Colon | 59 Semicolon | 33 Exclamation Mark | 63 Question Mark | 191 Inverted Question Mark | 1567 Arabic Question Mark | 161 Inverted Exclamation Mark | 40 Left Parenthesis | 41 Right Parenthesis | 91 Left Square Bracket | 93 Right Square Bracket | 123 Left Curly Bracket | 125 Right Curly Bracket | 60 Less-Than Sign | 62 Greater-Than Sign | 95 Low Line | 35 Number Sign | 42 Asterisk | 38 Ampersand | 12290 Ideographic Full Stop | 65311 Fullwidth Question Mark | 65281 Fullwidth Exclamation Mark | 65292 Fullwidth Comma | 12289 Ideographic Comma | 65307 Fullwidth Semicolon | 65306 Fullwidth Colon | 65374 Fullwidth Tilde | 183 Middle Dot | 2404 Devanagari Danda | 1548 Arabic Comma | 1563 Arabic Semicolon | 1748 Arabic Full Stop | 1642 Arabic Percent Sign

=> PUNCT doesn't contain . or +

- HYPHENS - unicode symbols : 45 Hyphen-Minus (1x 2x 3x) | 8211 En Dash | 8212 Em Dash (1x 2x) | 126 Tilde


- ELLIPSES - unicode symbols : 8230 Horizontal Ellipsis


- ICONS - list of symbols : https://www.compart.com/en/unicode/category/So

### Prefixes


https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

- punct, ellipses, quotes, currency, icons
- "§" (167 Section Sign), "%", "=", "—" (8212 Em Dash), "–" (8211 En Dash)
- "+" not followed by a number : "\+(?![0-9])+"

### Suffixes

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py
    
- punct, ellipses, quotes, icons
- english suffixes : "'s", "'S", "’s", "’S"
- "—" (8212 Em Dash), "–" (8211 En Dash)
- "+" preceded by numbers : "(?<=[0-9])\+"		
- "." preceded by temperature : "(?<=°[FfCcKk])\."		
- "." preceded by alphanumeric lower, "%" "²" "-" "+", quotes : "(?<=[0-9{al}{e}(?:{q})])\."
- "." preceded by at least two alphabetic upper : "(?<=[{au}][{au}])\."		
- currency preceded by number : "(?<=[0-9])(?:{c})"		
- unit preceded by number : (?<=[0-9])(?:{u})"				

### Infixes

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

- ellipses, icons
- "+" "-" "\*" "^" preceded by number and followed by number or "-" : "(?<=[0-9])[+\-\*^](?=[0-9-])"
- "." preceded by alphabetic lower or quote and followed by alphabetic upper or quote : "(?<=[{al}{q}])\.(?=[{au}{q}])"
- "," preceded by alphabetic and followed by alphabetic : "(?<=[{a}]),(?=[{a}])"
- hyphen preceded by alphabetic and followed by alphabetic : "(?<=[{a}])(?:{h})(?=[{a}])"
- ":" "<" ">" "=" "/" preceded by alphanumeric and followed by alphabetic : "(?<=[{a}0-9])[:<>=/](?=[{a}])"


### Exceptions

https://github.com/explosion/spaCy/blob/master/spacy/lang/tokenizer_exceptions.py

TOKEN_MATCH (never split)

- URL_PATTERN

BASE_EXCEPTIONS

- SPACE : space (32), non-breaking space (160), tab, "\t", newline, "\n", em dash (8212) => POS:SPACE
- special cases : \\"), &lt;space&gt;, '', C++
- enumerations with letters :  "a.", "b.", …, "z.", ""ä.", "ö.", "ü."
- emoticons	: 126 combinations like ":-)", ">:o", "^__^", "(ಠ_ಠ)", "¯\(ツ)/¯"

Notes : 

- URL_PATTERN described here https://mathiasbynens.be/demo/url-regex with a few modifications.


- emoticons - chars included : ": ; . - = * / \ ^ _ ( ) " ' [ ] { } < > | 0 1 3 8 D o O p P v V x X @ ¬ ಠ ︵ ¯ ツ ╯ ° □ ┻ ━"

## French data

### Char classes

--> Combining Base data and French data <--

https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/punctuation.py

ELISION = ' (39 Apostrophe) ’ (8217 Right Single Quotation Mark) 

HYPHENS = - (45 Hyphen-Minus) – (8211 En Dash) — (8212 Em Dash) ‐ (8208 Hyphen) ‑ (8209 Non-Breaking Hyphen)

Notes - differences with Base version :

- ELISION is new and specific to french


- HYPHENS : REMOVED 45 Hyphen-Minus (2x 3x) | 8212 Em Dash (2x) | 126 Tilde, ADDED 8208 Hyphen | 8209 Non-Breaking Hyphen

### Prefixes

--> Using Base data <--

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

- punct, ellipses, quotes, currency, icons
- "§" (167 Section Sign), "%", "=", "—" (8212 Em Dash), "–" (8211 En Dash)
- "+" not followed by a number : "\+(?![0-9])+"

### Suffixes

--> Using French data <--

https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/punctuation.py
  
- punct, ellipses, quotes
- "+" preceded by numbers : "(?<=[0-9])\+"		
- "." preceded by temperature : "(?<=°[FfCcKk])\."		
- "." preceded by alphanumeric lower, "%" "²" "-" "+", quotes : "(?<=[0-9{al}{e}(?:{q})])\."
- "." preceded by at least two alphabetic upper : "(?<=[{au}][{au}])\."
- currency preceded by number : "(?<=[0-9])(?:{c})"		
- unit preceded by number : (?<=[0-9])(?:{u})"				
- temperature unit after number : "(?<=[0-9])°[FfCcKk]"
- % after number : "(?<=[0-9])%"

Notes - differences with Base version :

- REMOVED - icons *[seems needed in French too]*


- REMOVED - english suffixes : "'s", "'S", "’s", "’S"


- REMOVED - "—" (8212 Em Dash), "–" (8211 En Dash)


- ADDED - temperature unit after number : "(?<=[0-9])°[FfCcKk]" *[seems needed in english too]*


- ADDED : % after number : "(?<=[0-9])%" *[redundant with unit preceded by number, because unit contains %]*

### Infixes

--> Combining Base data and French data <--

https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/punctuation.py


French data :
- (alphabetic elision) | alphabetic : "(?<=[{a}][{el}])(?=[{a}])"	

Base data :
- ellipses, icons
- "+" "-" "\*" "^" preceded by number and followed by number or "-" : "(?<=[0-9])[+\-\*^](?=[0-9-])"
- "." preceded by alphabetic lower or quote and followed by alphabetic upper or quote : "(?<=[{al}{q}])\.(?=[{au}{q}])"
- "," preceded by alphabetic and followed by alphabetic : "(?<=[{a}]),(?=[{a}])"
- hyphen preceded by alphabetic and followed by alphabetic : "(?<=[{a}])(?:{h})(?=[{a}])"
- ":" "<" ">" "=" "/" preceded by alphanumeric and followed by alphabetic : "(?<=[{a}0-9])[:<>=/](?=[{a}])"

### Exceptions

--> Combining Base data and French data <--

https://github.com/explosion/spaCy/blob/master/spacy/lang/fr/tokenizer_exceptions.py

TOKEN_MATCH (never split)
[in addition to URL_PATTERN]

\- char - Prefixes :

- "anti", 
- "apr[èe]s", "arrières?", "avant", "bas(?:ses?)?",
- "am[ée]ricano", "anglo", "arabo",
-  "a[ée]ro", "audio",
- "abat", "a[fg]ro", "aigues?", "arcs?", "archi", "bec?", "banc", "blanc",
- "avion", "bateaux?", "auto", "bio?",
- "belles?", "beau", "bien",
- "after", "best",
... 193 prefixes ...
- Cities prefixes : "Fontaine", "La Chapelle", "Marie", "Le Mesnil", "Neuville", "Pierre", "Val", "Vaux"

=> "^{prefix}[{hyphen}][{al}][{hyphen}{al}{elision}]\*$"

\- char - Compound words :

```
"^a[{hyphen}]sexualis[{al}]+$", "^binge[{hyphen}]watch[{al}]+$", "^black[{hyphen}]out[{al}]*$", "^bouche[{hyphen}]por[{al}]+$", "^burn[{hyphen}]out[{al}]*$", .... 50 compound words ...... "^teuf[{hyphen}]teuf[{al}]*$", "^yo[{hyphen}]yo[{al}]+$", "^zig[{hyphen}]zag[{al}]*$", "^z[{elision}]yeut[{al}]+$"
```

\- char - Double compound words (like saut-de-ski, pet-en-l'air) :

- "l[èe]s?", "la", "en", "des?", "d[eu]", "sur", "sous", "aux?", "à", "et", "près", "saint"

=> "^[{a}]+[{hyphen}]{hyphen_combo}[{hyphen}](?:l[{elision}])?[{a}]+$"

\' char - Prefixes :

- "r?é?entr", "grande?s?", "r"

=> "^{prefix}[{elision}][{al}][{hyphen}{al}{elision}]\*$"

TOKENIZER_EXCEPTIONS
[in addition to Base data]

. char - Dates :
- "janv." "févr." "avr." "juill." "sept." "oct." "nov." "déc." => "janvier" ... "décembre"
- "av." "apr." "J.-C." => "avant" "après" "Jésus" "Christ"

. char - Titles :
- "M." "Mr." "Mme." "Mlle." "Dr." => "monsieur" ... "docteur"
- "St." "Ste." => "saint" "sainte"

. char - Abbreviations :
- "etc."

° char - Abbreviations :
- "n°" "d°" => "numéro" "degrés"

\- char - "-t-" :
- "a" "est" "semble" "indique" "moque" "passe" +++ "-t" +++ "-elle", "-il", "-on"

\- char - "-ce" :
- "est" +++ "-ce"

' and \- chars :
- "qu'" "n'" +++ "est" +++ "-ce"

' char :
- "aujourd'hui" "Aujourd'hui"


Notes - suggested improvements :

- for ' and - characters, we should generate all variants of the exceptions for HYPHENS chars and ELISION chars