# Installation of Spacy

**!pip3 install spacy**

Since Spacy automatically doesn't download english version.

**!python3 -m spacy download en**

You can now load the model via **nlp=spacy.load('en')**

# Some Tutorials
* https://nlpforhackers.io/complete-guide-to-spacy/


### Basics


In [1]:
import spacy
nlp = spacy.load('en')
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"', token.idx) 

"Hello" 0
"    " 6
"World" 10
"!" 15


In [2]:
doc = nlp("Next week I'll   be in Madrid.  Won't I?")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	AUX	VB
in	20	in	False	False	xx	ADP	IN
Madrid	23	Madrid	False	False	Xxxxx	PROPN	NNP
.	29	.	True	False	.	PUNCT	.
 	31	 	False	True	 	SPACE	_SP
Wo	32	will	False	False	Xx	VERB	MD
n't	34	not	False	False	x'x	PART	RB
I	38	-PRON-	False	False	X	PRON	PRP
?	39	?	True	False	?	PUNCT	.


### Sentence detection

In [3]:
doc = nlp("These are apples. The webpage is www.google.com.")
 
for sent in doc.sents:
    print(sent)

These are apples.
The webpage is www.google.com.


### Part Of Speech Tagging

Source: 
1. https://www.nltk.org/book/ch05.html
2. https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: "there is" ... think of it like "there exists")
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective 'big'
* JJR adjective, comparative 'bigger'
* JJS adjective, superlative 'biggest'
* LS list marker 1)
* MD modal could, will
* NN noun, singular 'desk'
* NNS noun plural 'desks'
* NNP proper noun, singular 'Harrison'
* NNPS proper noun, plural 'Americans'
* PDT predeterminer 'all the kids'
* POS possessive ending parent's
* PRP personal pronoun I, he, she
* PRP possessive pronoun my, his, hers
* RB adverb very, silently,
* RBR adverb, comparative better
* RBS adverb, superlative best
* RP particle give up
* TO to go 'to' the store.
* UH interjection errrrrrrrm
* VB verb, base form take
* VBG verb, gerund/present participle taking
* VBN verb, past participle taken
* VBP verb, sing. present, non-3d take
* VBZ verb, 3rd person sing. present takes
* WDT wh-determiner which
* WP wh-pronoun who, what
* WP possessive wh-pronoun whose
* WRB wh-abverb where, when

In [4]:
doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Madrid', 'NNP'), ('.', '.')]


## Named Entity Recognition And Chunks

In [33]:
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
Madrid GPE


In [34]:
doc = nlp(u"Bank of America to build a Tech Center in Plano for \
            $20 Million dollars!!!")

for ent in doc.ents:
    print(f"{ ent.text:{20}} {ent.label_:{10}} ")
    #print(f"{token.pos_:{10}} {token.text:{10}} {token.dep_:{10}} {token} " )



Bank of America      ORG        
Tech Center          ORG        
Plano                GPE        
$20 Million dollars  MONEY      


In [43]:
for chunk in doc.noun_chunks:
    print(chunk)

Bank
America
a Tech Center
Plano
$20 Million dollars


### Computing Similarity


In [6]:
nlp = spacy.load('en_core_web_lg')
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit)) 
print(banana.similarity(fruit), banana.similarity(animal))

0.66185343 0.23552851
0.67148364 0.24272855


In [7]:
nlp = spacy.load('en')
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x11a292f50>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x11a3af440>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x11a29d280>)]


# Basics

In [20]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Bank of America is looking to buy a U.S. startup for 6 millions and sell it for 100 Billions')
for token in doc:
    #print(token.text, token, token.pos_)
    print(f"{token.pos_:{10}} {token.text:{10}} {token.dep_:{10}} {token} " )
    

PROPN      Bank       nsubj      Bank 
ADP        of         prep       of 
PROPN      America    pobj       America 
AUX        is         aux        is 
VERB       looking    ROOT       looking 
PART       to         aux        to 
VERB       buy        xcomp      buy 
DET        a          det        a 
PROPN      U.S.       compound   U.S. 
NOUN       startup    dobj       startup 
ADP        for        prep       for 
NUM        6          nummod     6 
NOUN       millions   pobj       millions 
CCONJ      and        cc         and 
VERB       sell       conj       sell 
PRON       it         dobj       it 
ADP        for        prep       for 
NUM        100        nummod     100 
NOUN       Billions   pobj       Billions 


## Pipeline

In [21]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x11ee9e450>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1224e8050>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11ef51de0>)]

In [22]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [24]:
doc2 = nlp(u"Tesla isn't looking into         startups anymore.")
for token in doc2:
    #print(token.text, token.pos_, token.dep_)
    print(f"{token.pos_:{10}} {token.text:{10}} {token.dep_:{10}} {token} " )


PROPN      Tesla      nsubj      Tesla 
AUX        is         aux        is 
PART       n't        neg        n't 
VERB       looking    ROOT       looking 
ADP        into       prep       into 
SPACE                                     
NOUN       startups   pobj       startups 
ADV        anymore    advmod     anymore 
PUNCT      .          punct      . 


## Spans

Large documents can be difficult to work with. A span slices the large document in of form of Doc[start:stop]

In [25]:
doc3 = nlp(u'Health officials in Dallas City and Los Angeles County \
           are signaling a change in local strategy when it comes to \
           coronavirus testing, recommending that doctors avoid testing \
           patients except in cases where a test result would \
           significantly change the course of treatment. A news release \
           from the Los Angeles Department of Public Health this week \ 
           advised doctors not to test those experiencing only mild \ 
           respiratory symptoms unless “a diagnostic result will change \ 
           clinical management or inform public health response. The \ 
           recommendation reflects a "shifting from a strategy of case \ 
           containment to slowing disease transmission and averting \
           excess morbidity and mortality," according to the statement. \
           The guidance said coronavirus testing at L.A. County public \
           health labs will prioritized those with symptoms, health care \
           workers, residents of long-term care facilities, paramedics \
           and other high-risk situations. Others are encouraged to \
           simply stay at home. At about the same time, the New York \
           City Department of Health directed all healthcare facilities \
           to immediately stop testing non-hospitalized patients for \
           Covid-19.')
shift_quote = doc3[87:107]
print (shift_quote)

"shifting from a strategy of case containment to slowing disease transmission and averting excess morbidity and mortality,"


In [26]:
print (type(shift_quote), type(doc3) )

<class 'spacy.tokens.span.Span'> <class 'spacy.tokens.doc.Doc'>


## Tokens
Tokens are the basic building blocks of the document object, which helps to understand the meaning of the text, which is is derived from token and shows the relationship to one to another

**prefix**: char at the beginning, suffix: at the end, infix: char in between

## Displacy

In [45]:
from spacy import displacy
doc = nlp(u"Apple is going to build a U.K. factory for $6 million.")
"""style -> dep syntactic dependency, ent -> entity"""
displacy.render(doc, style='dep', jupyter=True, options={'distance':70})

In [46]:
displacy.render(doc, style='ent')

In [48]:
doc_1 = nlp(u"This is a sentence")
displacy.render(doc, style='dep')

# Stemming 


* searching for the key word-write, it might return writes, writing, wrote, in this case, "write" is the stem for 'write, writing, writes...'
* It is a method for cataloging the related words, it essentially chops off letters from end until it reaches the stem is reached. in some cases this may not handle all cases, at that point lemmatization comes into picture
* common and effective stemming tools is Porter's Algorithms developed by Martin Porter in 1980, five phases of word reduction each with its own set of mapping rule, Snowball is the of stemming language also developed by Martin Porter, this algorithm is called **English Stemme** or**Porter2 Stemmer**

In [61]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
words = ['run', 'runner', 'running', 'ran', 
         'easily', 'fairly', 'fairness','runs']

for word in words:
    print (f"{word:{10}} {p_stemmer.stem(word):{10}}")


run        run       
runner     runner    
running    run       
ran        ran       
easily     easili    
fairly     fairli    
fairness   fair      
runs       run       


In [65]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english')
for word in words:
    print (f"{word:{10}} {s_stemmer.stem(word):{10}}")

run        run       
runner     runner    
running    run       
ran        ran       
easily     easili    
fairly     fair      
fairness   fair      
runs       run       


## Lemmatization
lemmatization looks beyond reduction, and consider language's full vocabulary to apply a morphological analysis to the word. Lemma of 'was' is 'be', the lemma of mice is 'mouse', lemma of meeting might be 'meet' or 'meeting depending on its use in the sentence. Spacy only has lemmatization, it doesn't have stemming like NLTK has

In [71]:
doc = nlp(u"I am a runner running in a race because \
            I love to run since I ran today")
for token in doc:
    print (f"{token.text:{10}} {token.pos_:{10}} {token.lemma:{30}} {token.lemma_:{10}} ")
    #print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

I          PRON                   561228191312463089 -PRON-     
am         AUX                  10382539506755952630 be         
a          DET                  11901859001352538922 a          
runner     NOUN                 12640964157389618806 runner     
running    VERB                 12767647472892411841 run        
in         ADP                   3002984154512732771 in         
a          DET                  11901859001352538922 a          
race       NOUN                  8048469955494714898 race       
because    SCONJ                16950148841647037698 because    
             SPACE                 7059649282071487404              
I          PRON                   561228191312463089 -PRON-     
love       VERB                  3702023516439754181 love       
to         PART                  3791531372978436496 to         
run        VERB                 12767647472892411841 run        
since      SCONJ                10066841407251338481 since      
I          PRON      

## Stop Words
[DEFINITION]: **Stopwords** are very common words that carry no meaning or less meaning compared to other keywords.

words like 'a', 'the'.. should be filter words, spacy holds 326 english stopwords


In [73]:
print(nlp.Defaults.stop_words, len(nlp.Defaults.stop_words))

{'much', "'m", 'had', 'seemed', 'next', 'also', 'was', 'always', 'he', 'up', 'your', 'after', 'we', 'hundred', 'last', 'former', 'anyhow', 'there', 'whereupon', 'into', 'him', 'not', 'thru', 'per', 'sometimes', 'still', 'n‘t', 'two', 'behind', 'elsewhere', 'it', 'twenty', 'through', 'whereafter', 'moreover', 'top', 'due', 'be', 'toward', 'nobody', 'what', 'take', 'beyond', 'nine', 'very', "n't", 'than', 'together', 'perhaps', 'out', 'least', 'me', 'yourselves', 'among', 'from', 'part', 'meanwhile', 'four', 'nor', 'made', "'re", 'while', 'afterwards', 'quite', 'mostly', 'eight', 'their', 'thereafter', 'whereby', 'the', '’re', 'between', 'serious', 'others', 'whose', 'at', 'her', 'somewhere', 'or', 'various', 'i', 'but', 'empty', '‘d', 'therein', 're', 'where', 'could', 'does', 'latterly', 'whence', 'yourself', 'some', 'say', 'above', 'another', 'please', 'name', 'us', 'except', 'rather', 'own', "'ll", 'anyway', 'may', 'thereupon', 'more', '‘ve', 'so', 'n’t', 'give', 'because', 'whom', '

In [75]:
print (nlp.vocab['is'])
print (nlp.vocab['mystery'].is_stop)

<spacy.lexeme.Lexeme object at 0x147298550>
False


### Manually adding stop words

In [76]:
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop

True

### Vocabulary and Phrase Matching with Spacy

**Not VERY CLEAR TO ME**

In [78]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# SolarPowerΩzxzqazswsw2wasw3edxxdrdr4erdcf`
pattern1 = [{'LOWER': 'solarpower'}]
# Solar-power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
# Solar Power
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

matcher.add('SolarPower', None, pattern1, pattern2, pattern3 )
doc = nlp(u"The Solar Power continues to grow solarpower for betterment of solar-power")
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 6, 7), (8656102463236116519, 10, 13)]


In [81]:
#nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, '\n', string_id,'\n', start, end, span.text)

15578876784678163569 
 HelloWorld 
 0 3 Hello, world


In [90]:
#import nltk
#nltk.download()
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/gshyam/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [93]:
import nltk
from nltk.stem import PorterStemmer
#from nltk.corpus import stopwords

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
print (sentences)


['I have three visions for India.', 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.', 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.', 'Yet we have not done this to any other nation.', 'We have not conquered anyone.', 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.', 'Why?', 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.', 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.', 'It is this freedom that\n               we must protect and nurture and build on.', 'If we are not free, no one will respect us.', 'My second vision for India’s development.', 'For 