In [2]:
import itertools
import spacy
from spacy import displacy

# Language (Model)

Language class has its default rules defined.

In [5]:
nlp = spacy.load("en_core_web_sm")

# Prefix, Infix, Suffix

Tokenizer extract prefix, infix, and suffix chracters from text as a token. The rules are defined in the Language model.

* [Tokenization](https://spacy.io/usage/linguistic-features#tokenization)

> During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. 
> 
> The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave tokens containing periods intact (abbreviations like “U.S.”).
> 
> * Prefix: Character(s) at the beginning, e.g. ```$, (, “, ¿```
> * Infix: Character(s) in between, e.g. ```-, --, /, …```
> * Suffix: Character(s) at the end, e.g. ```km, ), ”, !```

## Prefixes

In [6]:
list(itertools.islice(nlp.Defaults.prefixes, 10))

['§', '%', '=', '—', '–', '\\+(?![0-9])', '…', '……', ',', ':']

## infixes

In [22]:
nlp.Defaults.infixes[0]

'\\.\\.+'

## Suffixes

In [24]:
list(itertools.islice(nlp.Defaults.suffixes, 10))

['…', '……', ',', ':', ';', '\\!', '\\?', '¿', '؟', '¡']

In [28]:
doc = nlp("#cool 100-150km (😳) run omg... !")
for token in doc:
    print(token.text)

#
cool
100
-
150
km
(
😳
)
run
omg
...
!


## Example

* ```#```, ```(``` are tokenized by the prefix rules.
* ```-```, ```...``` are tokenized by the infix rules.
* ```)```, ```!``` are tokenized by the suffix rules.