# NLP with spaCy in python

### Nima Parandavar

   # How NLP works ?
   computers are emotionless, how it possible to train them to understand human language ?
   We need a system that can translate our language to numbers.
   
   ### Look at that examples
   - the 0.0897 0.0160 -0.0571 0.0405 -0.0696  ...
   - and -0.0314 0.0149 -0.0205 0.0557 0.0205  ...
   - of -0.0063 -0.0253 -0.0338 0.0178 -0.0966 ...
   - to 0.0495 0.0411 0.0041 0.0309 -0.0044    ...
   - in -0.0234 -0.0268 -0.0838 0.0386 -0.0321 ...
   
   If we visual these word we can see the words "the, and, of, to, in " are so close togather.
   

# How machine learning helps us to use NLP?
Machine learning allows you to accomplish three tasks: syntactic dependency parsing (determining the relationships between words in a sentence), part-of-speech tagging (identifying nouns, verbs, and other parts of speech), and named entity recognition (sorting proper nouns into categories like people, organizations, and locations).

# Working with Spacy
#### Spacy is a open source library, you can use it to create varety applications with that. support 74+ languages
#### It use machine learning to proccess texts

![alt text for screen reader](./img1.png)

#### Spacy use Nural network to train and predict word and grammers

![alt text for screen reader](./img2.png)

# What can spacy do ?
#### SpaCy uses neural models for syntactic dependency parsing, part-of-speech tagging, and named entity recognition.
# What can not spacy do?
#### One thing spaCy can’t do for you is recognized the user’s intent.
## Look at the following example:
#### I want to order a pair of jeans
![alt text for screen reader](./img3.png)
#### Notice that spaCy doesn’t mark anything as the user’s intent in the generated tree. In fact, it would be strange if it did so. 

# THE TEXT-PROCESSING PIPELINE

### Setup spaCy

### Follow the commands to install scpCy
- pip install -U spacy
- python -m spacy info
### Install Statical Model for spaCy
#### SpaCy dosen't have any models, we have to install this models that it understand
- python -m spacy download en_core_web_sm
### Now let's start :)

In [2]:
import spacy

In [3]:
# There are many Model for spacy
# language_core_web_sm / md / ls (use web to collect words)
# language_core_wiki_sm / ms / ls (use wikipedia to collect words)

nlp = spacy.load("en_core_web_sm")

# Basic NLP operation with spaCy
![alt text for screen reader](./img4.png)

# Tokenization
#### The very first action any NLP application typically performs on a text is parsing that text into tokens
#### Tokenization is the first operation because all the other operations require you to have tokens already in place.

In [11]:
string = u"I'm flying to London."
doc = nlp(string)
for token in doc:
    print(token)

I
'm
flying
to
London
.


In [10]:
string = u"I have presentation about NLP."
doc = nlp(string)
[token.text for token in doc]

['I', 'have', 'presentation', 'about', 'NLP', '.']

# Lemmatization
A lemma is the base form of a token. You can think of it as the form in which the token would appear if it were listed in a dictionary. For example, the lemma for the token “flying” is “fly.” Lemmatization is the process of reducing word forms to their lemma. The following script provides a simple example of how to do lemmatization with spaCy:

In [15]:
string = u"this product integrates both libraries for downloading and applying patches"
doc = nlp(string)
for token in doc:
    print(token.text, "\t\t", token.lemma_)

this 		 this
product 		 product
integrates 		 integrate
both 		 both
libraries 		 library
for 		 for
downloading 		 download
and 		 and
applying 		 apply
patches 		 patch


# part-of-speech
#### A part-of-speech tag tells you the part-of-speech (noun, verb, and so on) of a given word in a given sentence.
#### part-of-speech tags can include detailed information about a token. In the case of verbs, they might tell you the following features: tense (past, present, or future), aspect (simple, progressive, or perfect), person (1st, 2nd, or 3rd), and number (singular or plural).

In [17]:
string = "I have flown to LA, now flying to London."
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.pos_)

I 	 PRON
have 	 AUX
flown 	 VERB
to 	 ADP
LA 	 PROPN
, 	 PUNCT
now 	 ADV
flying 	 VERB
to 	 ADP
London 	 PROPN
. 	 PUNCT


In [19]:
string = "I have flown to LA, now flying to London."
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.tag_)

I 	 PRP
have 	 VBP
flown 	 VBN
to 	 IN
LA 	 NNP
, 	 ,
now 	 RB
flying 	 VBG
to 	 IN
London 	 NNP
. 	 .


![alt text for screen reader](./img5.png)

# So, how can part-of-speech help us ?


In [21]:
string = u'I have flown to LA. Now I am flying to Frisco.'
doc = nlp(string)
[token.text for token in doc if token.tag_ == "VBG" or token.tag_ == "VB"]

['flying']

# Syntatic relations
Now let’s combine the proper nouns with the verb that the part-of-speech tagger selected earlier. Recall that the list of verbs you could potentially use to identify the intent of the discourse contains only the verb “flying” in the second sentence. How can you get the verb/proper noun pair that best describes the intent behind the discourse? A human would obviously compose the verb/proper noun pairs from words found in the same sentence. Because the verb “flown” in the first sentence doesn’t meet the condition specified (remember that only infinitive and present progressive forms meet the condition), you’d be able to compose such a pair for the second sentence only: “flying, Frisco.”

![alt text for screen reader](./img6.png)
# Here some common dependecy labels
![alt text for screen reader](./img7.png)

In [23]:
string = "I have flown to LA. Now I am flying to Frisco."
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.pos_, "\t", token.dep_)

I 	 PRON 	 nsubj
have 	 AUX 	 aux
flown 	 VERB 	 ROOT
to 	 ADP 	 prep
LA 	 PROPN 	 pobj
. 	 PUNCT 	 punct
Now 	 ADV 	 advmod
I 	 PRON 	 nsubj
am 	 AUX 	 aux
flying 	 VERB 	 ROOT
to 	 ADP 	 prep
Frisco 	 PROPN 	 pobj
. 	 PUNCT 	 punct


#### But what it doesn’t show you is how words are related to each other in a sentence by means of the commonly called dependency arcs explained at the beginning of this section. To look at the dependency arcs in the sample discourse, replace the loop in the preceding script with the following one:

In [26]:
string = "I have flown to LA. Now I am flying to Frisco."
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.head.text  , "\t", token.pos_, "\t", token.dep_)

I 	 flown 	 PRON 	 nsubj
have 	 flown 	 AUX 	 aux
flown 	 flown 	 VERB 	 ROOT
to 	 flown 	 ADP 	 prep
LA 	 to 	 PROPN 	 pobj
. 	 flown 	 PUNCT 	 punct
Now 	 flying 	 ADV 	 advmod
I 	 flying 	 PRON 	 nsubj
am 	 flying 	 AUX 	 aux
flying 	 flying 	 VERB 	 ROOT
to 	 flying 	 ADP 	 prep
Frisco 	 to 	 PROPN 	 pobj
. 	 flying 	 PUNCT 	 punct


In [27]:
string = "I have flown to LA, Now I am flying to Frisco."
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.head.text  , "\t", token.pos_, "\t", token.dep_)

I 	 flown 	 PRON 	 nsubj
have 	 flown 	 AUX 	 aux
flown 	 flying 	 VERB 	 ccomp
to 	 flown 	 ADP 	 prep
LA 	 to 	 PROPN 	 pobj
, 	 flying 	 PUNCT 	 punct
Now 	 flying 	 ADV 	 advmod
I 	 flying 	 PRON 	 nsubj
am 	 flying 	 AUX 	 aux
flying 	 flying 	 VERB 	 ROOT
to 	 flying 	 ADP 	 prep
Frisco 	 to 	 PROPN 	 pobj
. 	 flying 	 PUNCT 	 punct


## Example: return Roots and object of preposition of sentences

##### You can travers through sentenses with your_nlp_object.sents

In [30]:
string = u'I have flown to LA. Now I am flying to Frisco.'
doc = nlp(string)
for sent in doc.sents:
    print([token.text for token in sent if token.dep_ == "ROOT" or token.dep_ == "pobj"])

['flown', 'LA']
['flying', 'Frisco']


# Named Entity Recognition

In [36]:
string = u'I have flown to LA. Now I am flying to Frisco.'
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.ent_type)

I 	 0
have 	 0
flown 	 0
to 	 0
LA 	 384
. 	 0
Now 	 0
I 	 0
am 	 0
flying 	 0
to 	 0
Frisco 	 383
. 	 0


# You can also see what kind of entity the word is.

In [39]:
string = u'I have flown to LA. Now I am flying to London.'
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.ent_type_)

I 	 
have 	 
flown 	 
to 	 
LA 	 GPE
. 	 
Now 	 
I 	 
am 	 
flying 	 
to 	 
London 	 GPE
. 	 


# GPE is geopolitical entity

# Find left childs of a word

In [63]:
string = u"I want a green apple"
doc = nlp(string)
print(len(doc), " - counting the words")
for token in doc:
    print(list(token.lefts), "\t\t", token.text)

5  - counting the words
[] 		 I
[I] 		 want
[] 		 a
[] 		 green
[a, green] 		 apple


In [64]:
# If you want just a word left childs ...
[w for w in doc[4].lefts]

[a, green]

In [65]:
string = "A severe storm hit the beach. It started to rain."
doc = nlp(string)
for i in range(len(doc)):
    
    print(doc[i].text, "\t", list(doc[i].children))
    

A 	 []
severe 	 []
storm 	 [A, severe]
hit 	 [storm, beach, .]
the 	 []
beach 	 [the]
. 	 []
It 	 []
started 	 [It, rain, .]
to 	 []
rain 	 [to]
. 	 []


# Also you can use enumerate(...)

In [66]:
counter = 0
for i, sent in enumerate(doc.sents):
    for token in sent:
        if token.pos_ == "VERB":
            counter += 1
    print(f"sentence {i} have {counter} VERB")

sentence 0 have 1 VERB
sentence 1 have 3 VERB


# token.i to access of doc index

In [67]:
for token in doc:
    print(token.text, "\t", token.i)

A 	 0
severe 	 1
storm 	 2
hit 	 3
the 	 4
beach 	 5
. 	 6
It 	 7
started 	 8
to 	 9
rain 	 10
. 	 11


# EXTRACTING AND USING LINGUISTIC FEATURES
### with Part-of-Speech Tags
### Suppose we want to extract price of a sentence

In [75]:
string = "The firm earned $1.5 million in 2017."
#  let’s extract the coarse-grained part-of-speech features from the tokens
doc = nlp(string)
for token in doc:
    print(token.text, "\t", token.pos_, "\t",  spacy.explain(token.pos_))

The 	 DET 	 determiner
firm 	 NOUN 	 noun
earned 	 VERB 	 verb
$ 	 SYM 	 symbol
1.5 	 NUM 	 numeral
million 	 NUM 	 numeral
in 	 ADP 	 adposition
2017 	 NUM 	 numeral
. 	 PUNCT 	 punctuation


In [60]:
# fine-grained part-of-speech tags 
for token in doc:
    print(token.text, "\t", token.tag_, "\t",  spacy.explain(token.tag_))

The 	 DT 	 determiner
firm 	 NN 	 noun, singular or mass
earned 	 VBD 	 verb, past tense
$ 	 $ 	 symbol, currency
1.5 	 CD 	 cardinal number
million 	 CD 	 cardinal number
in 	 IN 	 conjunction, subordinating or preposition
2017 	 CD 	 cardinal number
. 	 . 	 punctuation mark, sentence closer


# What diffrent between fine grained and coarse grained ?
### Now let's extracting description of money

In [71]:
doc = nlp(u"The firm earned $1.5 million in 2017")
pharse = ""
for token in doc:
    if token.tag_ == "$":
        pharse = token.text + " "
        i = token.i + 1
        while doc[i].tag_ == "CD":
            pharse += doc[i].text + " "
            i += 1
print(pharse)

$ 1.5 million 


# Turning statement into question
Suppose your NLP application must be able to generate a question from a submitted statement. For example, one way chatbots maintain conversations with the user is by asking the user a confirmatory question. When a user says, “I am sure,” the chatbot might ask something like, “Are you really sure?” To do this, the chatbot must be able to generate a relevant question.

Let’s say the user’s submitted sentence is this:

#### I can promise it is worth your time.

#### Give me seggest how we can create "Can you really promise it is worth my time?"  ?

# First, let's see part-of-speech labels

In [76]:
doc = nlp(u"I can promise it is worth your time.")
for token in doc:
    print(token.text, "\t", token.pos_, "\t", token.tag_,  "\t",  spacy.explain(token.tag_))

I 	 PRON 	 PRP 	 pronoun, personal
can 	 AUX 	 MD 	 verb, modal auxiliary
promise 	 VERB 	 VB 	 verb, base form
it 	 PRON 	 PRP 	 pronoun, personal
is 	 AUX 	 VBZ 	 verb, 3rd person singular present
worth 	 ADJ 	 JJ 	 adjective (English), other noun-modifier (Chinese)
your 	 PRON 	 PRP$ 	 pronoun, possessive
time 	 NOUN 	 NN 	 noun, singular or mass
. 	 PUNCT 	 . 	 punctuation mark, sentence closer


# Second, we should replace "I" to "you" becuse a chat bot wants to ask a question
In other words, a chatbot refers to itself as “I” or “me,” and it refers to a user as “you.”

The following steps outline what we need to do to generate a question from the original statement:

   - Change the order of words in the original sentence from “subject + modal auxiliary verb + infinitive verb” to “modal auxiliary verb + subject + infinitive verb.”
   - Replace the personal pronoun “I” (the sentence’s subject) with “you.”
   - Replace the possessive pronoun “your” with “my.”
   - Place the adverbial modifier “really” before the verb “promise” to emphasize the latter.
   - Replace the punctuation mark “.” with “?” at the end of the sentence.

![](./img8.png)

# Third, let's dive into code ;)

In [87]:
doc = nlp(u"I can promise it is worth your time.")
sent = ''
for index, token in enumerate(doc):
    if token.tag_ == "PRP" and doc[index + 1].tag_ == "MD" and doc[index + 2].tag_ == "VB":
        sent = f"{doc[index + 1].text} {doc[index].text}"
        
        sent = f"{sent} {doc[index + 2:].text}"
        break

doc = nlp(sent) 
doc

can I promise it is worth your time.

In [88]:
for i,token in enumerate(doc):
    if token.tag_ == 'PRP' and token.text == 'I':
        sent = doc[:i].text + ' you ' + doc[i+1:].text
        break

doc = nlp(sent)
doc

can you promise it is worth your time.

In [89]:
for i,token in enumerate(doc):
    if token.tag_ == 'PRP$' and token.text == 'your':
       sent = doc[:i].text + ' my ' + doc[i+1:].text
       break

doc = nlp(sent)
doc

can you promise it is worth my time.

In [90]:
for i,token in enumerate(doc):
    if token.tag_ == 'VB':
        sent = doc[:i].text + ' really ' + doc[i:].text
        break
doc = nlp(sent)
doc

can you really promise it is worth my time.

In [91]:
sent = doc[:len(doc)-1].text + '?'
print(sent)

can you really promise it is worth my time?


# It is not all thing :(
part-of-speech tags are a powerful tool for smart text processing. But in practice, you might need to know more about a sentence’s tokens to process it intelligently.

For example, you might need to know whether a personal pronoun is the subject of a sentence or a grammatical object. Sometimes, this task is easy. The personal pronouns “I,” “he,” “she,” “they,” and “we” will almost always be the subject. When used as an object, “I” turns into “me,” as in “A postman brought me a letter.”

# Look at this example
#### "I know you. You know me."
we can not use part-of-speech to make question, we have to use dependency

In [94]:
doc = nlp(u'I can promise it is worth your time.')
for token in doc:
    print(token.text, "\t", token.pos_, "\t", token.tag_, "\t", token.dep_, "\t", spacy.explain(token.dep_))

I 	 PRON 	 PRP 	 nsubj 	 nominal subject
can 	 AUX 	 MD 	 aux 	 auxiliary
promise 	 VERB 	 VB 	 ROOT 	 root
it 	 PRON 	 PRP 	 nsubj 	 nominal subject
is 	 AUX 	 VBZ 	 ccomp 	 clausal complement
worth 	 ADJ 	 JJ 	 acomp 	 adjectival complement
your 	 PRON 	 PRP$ 	 poss 	 possession modifier
time 	 NOUN 	 NN 	 npadvmod 	 noun phrase as adverbial modifier
. 	 PUNCT 	 . 	 punct 	 punctuation


#### Combining part-of-speech tags and dependency labels can give you a better picture of the grammatical role of each token in a sentence

# Similarity words

#### Look at the example
![](./img9.png)
![](./img10.png)
##### spaCy’s small models (those whose model size indicator is %sm) don’t include word vectors. You can still use the similarity method with these models to compare tokens, spans, and documents, but the results won’t be as accurate.

# Let's see how can find similarity words

In [95]:
doc = nlp("I want a green apple.")
doc.similarity(doc[2:5])

  doc.similarity(doc[2:5])


0.5759161577864177

#### The warning says we use a small model and the accuracy of similarity words are is not good ( it is not a real accuracy)
#### We can download fastext models from its website which have 1+ milions vector
#### After download the models we have to use spacy.load(...) to load model
- [Download model from fastext]( https://fasttext.cc/docs/en/english-vectors.html)
- python -m spacy init-model en /tmp/en_vectors_wiki_lg --vectors-loc wiki-news-300d-1M.vec
- nlp = spacy.load('/tmp/en_vectors_wiki_lg')

In [96]:
doc.similarity(doc)

1.0

In [97]:
doc[2:5].similarity(doc[2:5])

1.0

In [99]:
apple = nlp("apple")
orange = nlp("orange")
apple.similarity(orange)

  apple.similarity(orange)


0.35815912418710183

# Another example

In [100]:
token = nlp(u'fruits')[0]
doc = nlp(u'I want to buy this beautiful book at the end of the week. Sales of citrus have increased over the last year. How much do you know about this type of tree?')
for sent in doc.sents:
    print(sent.text)
    print('similarity to', token.text, 'is', token.similarity(sent),'\n')

I want to buy this beautiful book at the end of the week.
similarity to fruits is 0.009302210994064808 

Sales of citrus have increased over the last year.
similarity to fruits is 0.12056183815002441 

How much do you know about this type of tree?
similarity to fruits is 0.006569376215338707 



  print('similarity to', token.text, 'is', token.similarity(sent),'\n')


# Ok, Give me suggests to find similarity of entities and compare them to each other in this sentences
“Google Search, often referred to as simply Google, is the most used search engine nowadays. It handles a huge number of searches each day.”

“Microsoft Windows is a family of proprietary operating systems developed and sold by Microsoft. The company also produces a wide range of other software for desktops and servers.”

“Titicaca is a large, deep, mountain lake in the Andes. It is known as the highest navigable lake in the world.”

In [103]:
#first sample text
doc1 = nlp(u'Google Search, often referred to as simply Google, is the most used search engine nowadays. It handles a huge number of searches each day.')

#second sample text
doc2 = nlp(u'Microsoft Windows is a family of proprietary operating systems developed and sold by Microsoft. The company also produces a wide range of other software for desktops and servers.')

#third sample text
doc3 = nlp(u'Titicaca is a large, deep, mountain lake in the Andes. It is known as the highest navigable lake in the world.')

docs = [doc1,doc2,doc3]
spans = {}

for j,doc in enumerate(docs):
    named_entity_span = [doc[i].text for i in range(len(doc)) if doc[i].ent_type != 0]
    print(named_entity_span)
    named_entity_span = ' '.join(named_entity_span)
    named_entity_span = nlp(named_entity_span)
    spans.update({j:named_entity_span})

['Google', 'Search', 'Google']
['Microsoft', 'Windows', 'Microsoft']
['Titicaca', 'Andes']


In [104]:
print('doc1 is similar to doc2:',spans[0].similarity(spans[1]))
print('doc1 is similar to doc3:',spans[0].similarity(spans[2]))
print('doc2 is similar to doc3:',spans[1].similarity(spans[2]))

doc1 is similar to doc2: 0.8342535026823839
doc1 is similar to doc3: 0.7576086055423097
doc2 is similar to doc3: 0.7173840097389196


  print('doc1 is similar to doc2:',spans[0].similarity(spans[1]))
  print('doc1 is similar to doc3:',spans[0].similarity(spans[2]))
  print('doc2 is similar to doc3:',spans[1].similarity(spans[2]))


# VISUALIZATIONS
Perhaps the simplest way to discover insights in data is to represent that data graphically.allow you to immediately identify patterns within your data.

# displaCy Dependency Visualizer
The displaCy dependency visualizer generates a syntactic dependency visualization for a submitted text. To use its interactive demo, navigate to [sidplaCy website](https://explosion.ai/demos/displacy/). Replace the sample sentence in the “Text to parse” text box with your text, and then click the search icon (magnifying glass) at the right of the box to generate a visualization.

# Visualizing Dependency Parsing

In [5]:
from spacy import displacy

In [6]:
doc = nlp(u'I want a Greek pizza.')
displacy.serve(doc, style='dep')




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [7]:
string = u"""
Microsoft Windows is a family of proprietary operating systems developed and
sold by Microsoft."""

doc = nlp(string)
displacy.serve(doc, style='ent')



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# Text To Speech

Use pyttsx3 to convert text to speech, 
Run below command to install __pyttsx3__

- pip install pyttsx3

In [8]:
import pyttsx3 as tts

### Initial

In [9]:
engine = tts.init()

### Get rate

In [10]:
engine.getProperty("rate")

200

### Get volume

In [11]:
engine.getProperty("volume")

1.0

### Get voice

In [12]:
for voice in engine.getProperty("voices"):
    print(voice)
    print("------")

<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_DAVID_11.0
          name=Microsoft David Desktop - English (United States)
          languages=[]
          gender=None
          age=None>
------
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_ZIRA_11.0
          name=Microsoft Zira Desktop - English (United States)
          languages=[]
          gender=None
          age=None>
------


### Set property

In [13]:
engine.setProperty('rate', 150)
engine.setProperty('volume', 0.8)

voice = engine.getProperty("voices")[0].id
engine.setProperty('voice', voice)


### Now write and speek

In [14]:
string = u"""
Microsoft Windows is a family of proprietary operating systems developed and
sold by Microsoft. Bill Gates announced Microsoft Windows on November 10,
1983. Microsoft first released Windows for sale on November 20, 1985. Windows
1.0 was initially sold for $100.00, and its sales surpassed 500,000 copies in
April 1987. For comparison, more than a million copies of Windows 95 were sold
in just the first 4 days.
"""

engine.say(string)
engine.runAndWait()
engine.stop()

### Save to a file

In [13]:
engine.save_to_file(string, "test.mp3")
engine.runAndWait()