Tokenization using spacy

When working with natural language processing (NLP), understanding how text is broken down is fundamental. At the heart of this process is the concept of tokens. In this post, we’ll explore what tokens are, why they matter, and how you can work with them in spaCy, one of the most popular NLP libraries.

spaCy makes tokenization simple and efficient

In [1]:
import spacy
from spacy.symbols import ORTH

In [5]:
nlp = spacy.load("en_core_web_sm")

text = '''Your data text goes here. It can be multiple sentences, e.g.:
Apple is looking at buying "U.K." startup for $1 billion!'''
doc = nlp(text)

print("\n======= Tokens =======")




In [6]:
for token in doc:
    print(token.text)


Your
data
text
goes
here
.
It
can
be
multiple
sentences
,
e.g.
:


Apple
is
looking
at
buying
"
U.K.
"
startup
for
$
1
billion
!


In [7]:
print("\n======= Tokenization explanation =======")
tok_exp = nlp.tokenizer.explain(text)
for t in tok_exp:
    print(t[1], "\t", t[0])



Your 	 TOKEN
data 	 TOKEN
text 	 TOKEN
goes 	 TOKEN
here 	 TOKEN
. 	 SUFFIX
It 	 TOKEN
can 	 TOKEN
be 	 TOKEN
multiple 	 TOKEN
sentences 	 TOKEN
, 	 SUFFIX
e.g. 	 SPECIAL-1
: 	 SUFFIX
Apple 	 TOKEN
is 	 TOKEN
looking 	 TOKEN
at 	 TOKEN
buying 	 TOKEN
" 	 PREFIX
U.K. 	 TOKEN
" 	 SUFFIX
startup 	 TOKEN
for 	 TOKEN
$ 	 PREFIX
1 	 TOKEN
billion 	 TOKEN
! 	 SUFFIX


In [8]:
print("\n======= Tokens information =======")





In [9]:
for token in doc:
    print(f"""token: {token.text},\
    lemmatization: {token.lemma_},\
    pos: {token.pos_},\
    is_alpha: {token.is_alpha},\
    is_stopword: {token.is_stop}""")

print("\n======= Customization =======")

token: Your,    lemmatization: your,    pos: PRON,    is_alpha: True,    is_stopword: True
token: data,    lemmatization: data,    pos: NOUN,    is_alpha: True,    is_stopword: False
token: text,    lemmatization: text,    pos: NOUN,    is_alpha: True,    is_stopword: False
token: goes,    lemmatization: go,    pos: VERB,    is_alpha: True,    is_stopword: False
token: here,    lemmatization: here,    pos: ADV,    is_alpha: True,    is_stopword: True
token: .,    lemmatization: .,    pos: PUNCT,    is_alpha: False,    is_stopword: False
token: It,    lemmatization: it,    pos: PRON,    is_alpha: True,    is_stopword: True
token: can,    lemmatization: can,    pos: AUX,    is_alpha: True,    is_stopword: True
token: be,    lemmatization: be,    pos: AUX,    is_alpha: True,    is_stopword: True
token: multiple,    lemmatization: multiple,    pos: ADJ,    is_alpha: True,    is_stopword: False
token: sentences,    lemmatization: sentence,    pos: NOUN,    is_alpha: True,    is_stopword: Fa

In [10]:
# customization
text = ("Gonna to the Beach")
doc = nlp(text)
print([w.text for w in doc])

['Gon', 'na', 'to', 'the', 'Beach']


In [11]:
# Add special case rule
special_case = [{ORTH: "Gon"}, {ORTH: "na"}]
nlp.tokenizer.add_special_case("Gonna", special_case)

In [None]:
# Check new tokenization
print([w.text for w in nlp("Gonna to the Beach")])  #['Gon', 'na', 'to', 'the', 'Beach']

['Gon', 'na', 'to', 'the', 'Beach']


In [None]:
# The special case rules have precedence over the punctuation splitting
doc = nlp("....Gonna now !!!! there ** </>")    # phrase to tokenize
print([w.text for w in doc])  #['....', 'Gon', 'na', 'now', '!', '!', '!', '!', 'there', '*', '*', '<', '/', '>']

['....', 'Gon', 'na', 'now', '!', '!', '!', '!', 'there', '*', '*', '<', '/', '>']
