<a href="https://colab.research.google.com/github/plaban1981/Spacy/blob/master/Spacy_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import spaCy and load the language library

In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')


## spaCy Objects

After importing the spacy module in the cell above we loaded a model and named it nlp.

Next we created a Doc object by applying the model to our text, and named it doc.

## Create a Document Object

In [0]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [0]:
for text in doc:
  print(text.text,'---',text.pos_,'---',text.dep_)

Tesla --- PROPN --- nsubj
is --- VERB --- aux
looking --- VERB --- ROOT
at --- ADP --- prep
buying --- VERB --- pcomp
U.S. --- PROPN --- compound
startup --- NOUN --- dobj
for --- ADP --- prep
$ --- SYM --- quantmod
6 --- NUM --- compound
million --- NUM --- pobj


##Pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations 
- to tag, ==> Tagger 
- parse and ==> Parser
- describe the data. ==> Named Entity Recognition

In [0]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fd546028588>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fd5434e5b88>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fd5434e5be8>)]

## Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information

In [0]:
doc[0],doc[2]

(Tesla, looking)

## POS - Parts of Speech

In [0]:
doc[0].pos_,doc[2].pos_

('PROPN', 'VERB')

##Dependencies
We also looked at the syntactic dependencies assigned to each token.

In [0]:
doc[0].dep_,doc[2].dep_

('nsubj', 'ROOT')

##To see the full name of a tag use spacy.explain(tag)

In [0]:
spacy.explain('nsubj'),spacy.explain('ROOT'),spacy.explain('PROPN')

('nominal subject', None, 'proper noun')

##Additional Token Attributes

## text - The original word text

In [0]:
doc[0].text,doc[2].text

('Tesla', 'looking')

##lemma - The base form of the word

In [0]:
doc[0].lemma_,doc[2].lemma_,doc[5].lemma_

('Tesla', 'look', 'U.S.')

##pos_ - The simple part-of-speech tag

In [0]:
doc[0].pos_,doc[5].pos_,doc[5].pos_

('PROPN', 'PROPN', 'PROPN')

## tag_ - The detailed part-of-speech tag

In [0]:
doc[0].tag_,doc[5].tag_,doc[5].tag_

('NNP', 'NNP', 'NNP')

## Shape - The word shape – capitalization, punctuation, digits

In [0]:
doc[0].shape_,doc[5].shape_,doc[5].shape_

('Xxxxx', 'X.X.', 'X.X.')

##is_alpha - Is the token an alpha character?

In [0]:
doc[0].is_alpha,doc[5].is_alpha,doc[5].is_alpha

(True, False, False)

## is_stop - Is the token part of a stop list, i.e. the most common words of the language?

In [0]:
doc[0].is_stop,doc[5].is_stop,doc[5].is_stop

(False, False, False)

##Spans
Large Doc objects can be hard to work with at times. 

A span is a slice of Doc object in the form Doc[start:stop].

In [0]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [0]:
doc3.text

'Although commmonly attributed to John Lennon from his song "Beautiful Boy", the phrase "Life is what happens to us while we are making other plans" was written by cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.'

In [0]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [0]:
type(life_quote)

spacy.tokens.span.Span

##Sentences - Doc.sents

In [0]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sentences in doc4.sents:
  print(sentences)
  print("\n")

This is the first sentence.


This is another sentence.


This is the last sentence.




## Tokenization
The first step in creating a Doc object is to break down the incoming text into component pieces or "tokens".

In [0]:
mystring = '"We\'re moving to L.A.!"'
print(mystring)
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

"We're moving to L.A.!"
" | We | 're | moving | to | L.A. | ! | " | 

##Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [0]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


#### Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.

In [0]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


####Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.

##Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [0]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


##Counting Tokens - Doc objects have a set number of tokens:

In [0]:
len(doc3)

9

## Counting Vocab Entries - Vocab objects contain a full library of items!

In [0]:
len(doc3.vocab)

551

##Tokens can be retrieved by index position and slice

Doc objects can be thought of as lists of token objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [0]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [0]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

##Tokens cannot be reassigned
Although Doc objects can be considered lists of tokens, they do not support item reassignment:

In [0]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [0]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: ignored

##Named Entities
The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

In [0]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')
for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [0]:
doc8.ents

(Apple, Hong Kong, $6 million)

In [0]:
for ent in doc8.ents:
    print(ent.text+' --- '+ent.label_+' --- '+str(spacy.explain(ent.label_)))

Apple --- ORG --- Companies, agencies, institutions, etc.
Hong Kong --- GPE --- Countries, cities, states
$6 million --- MONEY --- Monetary values, including unit


In [0]:
len(doc8.ents)

3

##Noun Chunks
Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. 

In [0]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [0]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [0]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
a one-eyed, one-horned, flying, purple people-eater


##Built-in Visualizers

spaCy includes a built-in visualization tool called displaCy. 

displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. W

hen you export your notebook, the visualizations will be included as HTML.

In [0]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

#### Note:
The optional 'distance' argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.

##Visualizing the entity recognizer

In [0]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

##Creating Visualizations Outside of Jupyter

In [0]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


##Stemming
Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. 

**In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization.**

##PorterStemmer

In [0]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *

In [0]:
p_stemmer = PorterStemmer()

In [65]:
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
  print(word+ ' ---> '+p_stemmer.stem(word))

run ---> run
runner ---> runner
running ---> run
ran ---> ran
runs ---> run
easily ---> easili
fairly ---> fairli


##Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we'll use it here.

In [66]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english')

for word in words:
  print(word + '---> '+s_stemmer.stem(word))

run---> run
runner---> runner
running---> run
ran---> ran
runs---> run
easily---> easili
fairly---> fair


##Stemming has its drawbacks. 

If given the token **saw**, **stemming** might always return **saw**, whereas **lemmatization** would likely return either **see** or **saw** depending on whether the use of the token was as a verb or a noun. 

In [67]:
phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> I
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


## Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. 

In [69]:
doc_1 = nlp(u'I am meeting him tomorrow at the meeting')
for token in doc_1:
  print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
meeting 	 VERB 	 6880656908171229526 	 meet
him 	 PRON 	 561228191312463089 	 -PRON-
tomorrow 	 NOUN 	 3573583789758258062 	 tomorrow
at 	 ADP 	 11667289587015813222 	 at
the 	 DET 	 7425985699627899538 	 the
meeting 	 NOUN 	 14798207169164081740 	 meeting


In [70]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 VERB 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 ADP 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 ADP 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


##Function to display lemmas

In [0]:
def show_lemmas(text):
  for token in text:
    print(f'{token.text:{12}}{token.pos_:{6}}{token.lemma:<{22}}{token.lemma_}')

In [74]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I           PRON  561228191312463089    -PRON-
saw         VERB  11925638236994514241  see
eighteen    NUM   9609336664675087640   eighteen
mice        NOUN  1384165645700560590   mouse
today       NOUN  11042482332948150395  today
!           PUNCT 17494803046312582752  !


#### Notice that the lemma of saw is see, mice is the plural form of mouse, and yet eighteen is its own number, not an expanded form of eight.

In [75]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")
show_lemmas(doc3)

I           PRON  561228191312463089    -PRON-
am          VERB  10382539506755952630  be
meeting     VERB  6880656908171229526   meet
him         PRON  561228191312463089    -PRON-
tomorrow    NOUN  3573583789758258062   tomorrow
at          ADP   11667289587015813222  at
the         DET   7425985699627899538   the
meeting     NOUN  14798207169164081740  meeting
.           PUNCT 12646065887601541794  .


##Here the lemma of meeting is determined by its Part of Speech tag.

In [76]:
doc4 = nlp(u"That's an enormous automobile")
show_lemmas(doc4)

That        DET   4380130941430378203   that
's          VERB  10382539506755952630  be
an          DET   15099054000809333061  an
enormous    ADJ   17917224542039855524  enormous
automobile  NOUN  7211811266693931283   automobile


####Note that lemmatization does not reduce words to their most basic synonym - that is, enormous doesn't become big and automobile doesn't become car.