# Tokenization

Tokenization is the process of breaking down the raw text into component pieces or tokens. Tokens have an identified meaning; they are often words, but might be also spaces, punctuation, negation particles, etc. -- because all those have also an identifiable meaning!

A figure from [Spacy: Linguistic Features](https://spacy.io/usage/linguistic-features) gives a good example:

![Tokenization (from the Spacy website)](../pics/tokenization.png)

Note that tokenization does not change the text yet, tokens are pieces of the original text, tokenization breaks it down to particles. The splitting occurs when these elements are found:
- White space: ` `
- Prefixes: `" $ (`
- Suffixes: characters at the end: `km ) !`
- Infixes: characters in-between: `/ -`
- Exceptions: tokens are split or prevented from splitting depending on the case: `let's`, `U.S.`

However, punctuation or similar symbols part of email addresses and similar are kept part of the token.

Overview of contents:
1. Tokenization Examples
2. Accessing and Handling Tokens in a `Doc`
3. Named Entities
4. Noun Chunks = Sintagma Nominal
5. Visualizers: Syntatic Dependencies & Entities

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by Jos√© Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Tokenization Examples

In [34]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

In [35]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [36]:
# Create a Doc object and explore tokens
doc = nlp(mystring)
for token in doc:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

In [37]:
# Punctuation or similar symbols part of email addresses and similar
# are kept part of the token
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [38]:
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [47]:
# Dollar and km symbol, amounts and NYC are separated, but not number decimal points
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

In [40]:
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [48]:
# Place names with punctuation conserved
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

In [50]:
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


## 2. Accessing and Handling Tokens in a `Doc`

In [52]:
doc5 = nlp(u'It is better to give than to receive.')

In [53]:
# Number of tokens in a Doc
len(doc5)

9

In [54]:
# Retrieve the third token
doc5[2]

better

In [55]:
# Retrieve three tokens from the middle
doc5[2:5]

better to give

In [56]:
# Retrieve the last four tokens:

doc5[-4:]

than to receive.

In [59]:
# Tokens cannot be re-assigned
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [58]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

## 3. Named Entities

[Named Entities](https://spacy.io/usage/linguistic-features#named-entities) are particular concepts that can be unique (proper nouns) or might have a particular context.

In [60]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

In [63]:
for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [65]:
# Named entities can be accessed through .ents
# Named entities are: companies, place names, money amounts, etc.
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


## 4. Noun Chunks = Sintagma Nominal

[Noun chunks](https://spacy.io/usage/linguistic-features) are groups of words which denote a concept with its properties. The core word is a noun, and the rest of the words can be anything describing the main noun (in English, the descriptors are often other nouns). In Spanish, as far as I known, noun chunks are called *sintagmas nominales*.

In [66]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

In [67]:
# Access all noun chunks
for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [68]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

In [69]:
for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [72]:
# Note the sophistication when detecting that long noun chunk!
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

In [73]:
for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


## 5. Visualizers: Syntatic Dependencies & Entities

In [74]:
from spacy import displacy

In [75]:
doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')

In [81]:
# Syntactic dependency display of the Doc
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [79]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')

In [82]:
# Entity display
displacy.render(doc, style='ent', jupyter=True)