In [17]:
import os 
path = 'C:/Users/OYO/NLP/Course_Material'
os.chdir(path)

### Tokenization is the process of breaking original raw text into component pieces (tokens).

In [27]:
import spacy
nlp = spacy.load('en')

In [14]:
# Create a string that includes opening and closing quotation marks

mystring = '"We\'re moving to L.A.!"'
print(mystring)
doc = nlp(mystring)


for token in doc:
    print(token.text, end='|')

"We're moving to L.A.!"
"|We|'re|moving|to|L.A.|!|"|

**Tokens are basic building blocks of a Doc object - everything that helps us understand the meaning of the text derived from tokens and their relationship to one another.**

## Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

- **Prefix** : Character(s) at the beginning &#9656; `$ ( “ ¿`
- **Suffix** : Character(s) at the end &#9656; `km ) , . ! ”` 
- **Infix**  : Character(s) in between &#9656; `- -- / ...`
- **Exception** : Special-case rule to split a string into several tokens or prevent a token from being split when punctuation                   rules are applied &#9656; `St. U.S.`

In [21]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for token in doc2:
    print(token.text, token.pos_)

We PRON
're AUX
here ADV
to PART
help VERB
! PUNCT
Send VERB
snail NOUN
- PUNCT
mail NOUN
, PUNCT
email NOUN
support@oursite.com X
or CCONJ
visit VERB
us PRON
at ADP
http://www.oursite.com X
! PUNCT


<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [22]:
doc3 = nlp(u"A 5km NYC cab ride costs $10.30")

for t in doc3:
    print(t.text, t.pos_)

A DET
5 NUM
km NOUN
NYC PROPN
cab NOUN
ride NOUN
costs VERB
$ SYM
10.30 NUM


<font color=green>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.</font>

### Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [25]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t.text,t.pos_)
print(len(doc4))

Let VERB
's PRON
visit VERB
St. PROPN
Louis PROPN
in ADP
the DET
U.S. PROPN
next ADJ
year NOUN
. PUNCT
11


<font color=green>Here the abbreviations for "Saint" and "United States" are both preserved.</font>

In [32]:
# Check the vocab object
len(doc4.vocab)

512

In [33]:
doc5 = nlp(u"It is better to give than to receive.")

In [34]:
doc5[0]

It

In [35]:
doc[2:5]

're moving to

### Doc object does not support item reassignment.

In [36]:
doc5[0] = 'Text'

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

___
# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [39]:
doc6 = nlp(u"Apple to build a Hong Kong factory for $6 million")

for token in doc6:
    print(token.text, end = " | ")

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [45]:
for entity in doc6.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




<font color=green>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity:  `$6 million`</font>

---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [47]:
doc7 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.


In [48]:
from spacy import displacy

## Visualizing the dependency parse

In [49]:
doc = nlp(u"Apple is going to build a U.K. factory for $6 million.")

In [54]:
displacy.render(doc,style='dep',jupyter=True, options={'distance':80})

## Visualizing the entity parse

In [55]:
doc1 = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.")

In [56]:
displacy.render(doc1, style='ent',jupyter=True)

## Visualizing outside of Jupyter

In [58]:
doc2 = nlp(u"This is a sentence.")
displacy.serve(doc2,style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [59]:
#Type in new tab 127.0.0.1:5000