# Tokenization

The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

Tokens are the basic building blocks of a `Doc` object.

- Everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [3]:
doc = nlp(mystring)

for token in doc:
  print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. 

That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. **Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.**

# Prefixes, Suffixes and Infixes

spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [4]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
  print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [6]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
  print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


# Exceptions

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [7]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
  print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


Counting tokens

In [25]:
len(doc4)

11

Counting Vocab Entries

In [26]:
len(doc4.vocab)

841

# Tokens can be retrieved by index position and slice

`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [10]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [11]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [12]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

# Tokens cannot be reassigned

Although `Doc` objects can be considered lists of tokens, they do not support item reassignment:

In [13]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [14]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: ignored

# Named Entities

Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

In [15]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
  print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
  print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [16]:
len(doc8.ents)

3

# Noun Chunks

Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun 

In [17]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
  print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [19]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
  print(chunk.text)

Red cars
higher insurance rates


In [20]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


# Built in Visualizers

spaCy includes a built-in visualization tool called **displaCy**.

## Visualizing the dependency parse

In [23]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

## Visualizing the entity recognizer

In [24]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)