# Working with SPACY

In [1]:
#Importing spacy
import spacy

In [2]:
#Loading the english language(Loading the model)
#python -m spacy download en (for downloading english language, need to run once)
nlp=spacy.load('en_core_web_sm')

In [3]:
#Applying the model to the text
doc=nlp(u'Tesla is looking at buying U.S. startup for $6 million')

Here, using the language library that we imported, it is going to parse the entire string into several components known as Tokens.Essentially each word will become Token

The nlp() function from spacy automatically takes raw text and performs series of operations to tag,parse,and describe the text data.

In [4]:
for token in doc:
    print(token,token.pos,token.pos_)  
#POS stands for Part of speech and these numbers denote things like adverb,verb,noun etc.
#POS_ stands for Part of speech and display the acutal name of the numbers like adverb,verb,noun,pronoun etc.

Tesla 96 PROPN
is 87 AUX
looking 100 VERB
at 85 ADP
buying 100 VERB
U.S. 96 PROPN
startup 92 NOUN
for 85 ADP
$ 99 SYM
6 93 NUM
million 93 NUM


In [5]:
#Pipeline object
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x106b98b0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x10983040>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x10935100>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x109356a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x10979040>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x10a01940>)]

First step in processing any text is to split it up all the components parts i.e. the words and punctuation into tokens.

In [6]:
doc2=nlp(u"Tesla isn't  looking into startups anymore.")

In [7]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)

#token.dep_ means syntatic dependency

Tesla PROPN nsubj
is AUX aux
n't PART neg
  SPACE dep
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [8]:
#Using indexing to grab token
doc2[0], doc2[0].pos_

(Tesla, 'PROPN')

### Additional Token Attributes
Other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

### Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [9]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [10]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [11]:
type(life_quote)

spacy.tokens.span.Span

### Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`.

In [12]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [13]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [14]:
#Asking is it the start of the sentence, if false it will not reply anything
doc4[6].is_sent_start

True

## 1. Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

In [15]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [16]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


In [17]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [18]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [19]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [20]:
#count the no. of tokens
len(doc4)

11

In [21]:
#Vocab objects contain a full library of items!
len(doc.vocab)

831

### Tokens can be retrieved by index position and slice

In [22]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [23]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [24]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

### Tokens cannot be reassigned
Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:

In [25]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

In [26]:
# Try to change "My dinner was horrible" to "My dinner was delicious" (will give u an error)
#doc6[3] = doc7[3]

## 2. Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [27]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [28]:
#Named entities and its particular label, ORG for organization and MONEY for $6 million and more explanation for label
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




### Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [29]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [30]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


### Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit Spacy.io documentation

In [31]:
from spacy import displacy

In [32]:
doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')

In [33]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})
#Style denotes the syntatic display and the distance parameter is the distance between the tokens

In [34]:
#Visualizing the entity recognizer
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

#Style=ent for named entity 

In [35]:
#Outside jupyter, we use displacy.serve
# doc = nlp(u'This is a sentence.')
# displacy.serve(doc, style='dep')

## 3. Stemming
We will perform Stemming using NLTK

### Porter Stemming Algorithms

In [36]:
 import nltk

In [37]:
from nltk.stem.porter import PorterStemmer

In [38]:
p_stemmer=PorterStemmer()

In [39]:
words=['run','runner','ran','runs','easily','fairly']

In [40]:
#Stemming the words
for word in words:
    print(word + '---------->' + p_stemmer.stem(word))

run---------->run
runner---------->runner
ran---------->ran
runs---------->run
easily---------->easili
fairly---------->fairli


### Snowball Stemming Algorithm

In [41]:
from nltk.stem.snowball import SnowballStemmer

In [42]:
s_stemmer=SnowballStemmer(language="english")

In [43]:
words=['run','runner','ran','runs','easily','fairly']

In [44]:
#Stemming the words
for word in words:
    print(word + '---------->' + s_stemmer.stem(word))

run---------->run
runner---------->runner
ran---------->ran
runs---------->run
easily---------->easili
fairly---------->fair


In [45]:
#Another example
words=['generous','generation','generously','generate']

In [46]:
#Stemming the words
for word in words:
    print(word + '---------->' + s_stemmer.stem(word))

generous---------->generous
generation---------->generat
generously---------->generous
generate---------->generat


## 4. Lemmatization

In [47]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

In [48]:
for token in doc1:
    print(token.text,'\t',token.pos_,'\t',token.lemma_)

I 	 PRON 	 I
am 	 AUX 	 be
a 	 DET 	 a
runner 	 NOUN 	 runner
running 	 VERB 	 run
in 	 ADP 	 in
a 	 DET 	 a
race 	 NOUN 	 race
because 	 SCONJ 	 because
I 	 PRON 	 I
love 	 VERB 	 love
to 	 PART 	 to
run 	 VERB 	 run
since 	 SCONJ 	 since
I 	 PRON 	 I
ran 	 VERB 	 run
today 	 NOUN 	 today


## 5. Stop Words

In [49]:
#Listing the default words
print(nlp.Defaults.stop_words)

{'none', 'back', 'noone', '’ve', 'since', 'twenty', 'first', 'those', 'would', 'move', 'too', 'twelve', 'meanwhile', 'due', 'wherein', 'forty', 'me', 'without', "'ll", 'quite', 'regarding', 'fifty', 'yourself', 'thence', '’ll', 'toward', 'itself', 'indeed', 'behind', 'did', '’d', 'whose', 'a', 'become', 'moreover', 'himself', 'around', 'by', 'empty', 'not', 'five', '‘s', 'over', 'neither', 'say', 'mostly', 'sixty', 'among', 'an', 'his', 'seems', 'whoever', 'could', 'hereupon', 'thru', 'latterly', 'every', 'four', 'three', 'keep', 'using', 'of', 'yourselves', 'as', 'nobody', 'out', 'she', 'yours', 'give', "'s", 'doing', 'n‘t', 'be', 'in', 'than', "n't", 'or', 'your', 'formerly', 'almost', 'so', 'hundred', 'ever', 'show', 'six', 'he', 'bottom', 'nevertheless', 'now', 'them', 'been', 'had', 'someone', 'part', 'sometimes', '‘m', 'sometime', 'amount', 'already', 'down', '’re', 'everyone', 'amongst', 'otherwise', 'it', 'n’t', 'least', 'few', 'serious', 'beforehand', 'several', 'hereby', 'her

In [50]:
len(nlp.Defaults.stop_words)

326

In [51]:
#Checking that the word is a stop word
nlp.vocab['the'].is_stop

True

In [52]:
nlp.vocab['mystery'].is_stop

False

In [53]:
#Adding word to stop words
nlp.Defaults.stop_words.add('btw')

In [54]:
len(nlp.Defaults.stop_words)

327

In [55]:
nlp.vocab['btw'].is_stop

True

In [56]:
#Removing the word from the list of stop words
nlp.Defaults.stop_words.remove('beyond')

In [57]:
nlp.vocab['beyond'].is_stop

False

## 6. Vocabulary and Matching

**Rule Based Matching**

spaCy offers a rule-matching tool called Matcher that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [58]:
from spacy.matcher import Matcher

In [59]:
matcher=Matcher(nlp.vocab)

In [60]:
#SolarPower
pattern1 = [{'LOWER': 'solarpower'}]
#Solar power
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
#Solar-power
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

matcher.add('SolarPower',[pattern1, pattern2, pattern3])

In [61]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [73]:
found_matches = matcher(doc)
print(found_matches)

#The first is the match ID and the next one is start and stop of the index in the sentence.

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]
