## Natural Language Processing Techniques using spaCy 

This notebook explains NLP techniques using python's library - spaCy

### Contents 

1. About spaCy  
2. Installation  
3. Dataset Preparation for spaCy  
4. Tokenization - Word and Sentences  
5. Text Cleaning - Stopwords, Punctuations, Lemmatization  
6. Part of Speech Tagging  
7. Entity Extraction  
8. Noun Phrase Chunking  
9. Dependency Parsing  
10. Word Vector Notations  

### 1. About spaCy 

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. Following are the key features of spaCy : 

- Non-destructive tokenization
- Named entity recognition
- Support for 34+ languages
- 13 statistical models for 8 languages
- Pre-trained word vectors
- Easy deep learning integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Built in visualizers for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy model packaging and deployment
- State-of-the-art speed
- Robust, rigorously evaluated accuracy  

With so many features inbuilt, spaCy is considered as one of the powerful NLP library. 

### 2. Installation 

To install spaCy, two steps are required, 

2.1 Install the spaCy source using pip command

    pip install spacy

2.2 Download the spacy pre-trained and annotated models  

    python -m spacy download en_core_web_sm  
    
### 3. Dataset Preparation  

We first load the required libraries and prepare the data for spaCy usage.  We will load the spacy models (which were downloaded duing the installation step) and create an object "nlp". 

In [1]:
import spacy

In [3]:
# !python -m spacy download en_core_web_sm

In [4]:
nlp = spacy.load('en_core_web_sm')

Consider that we have a text document obtained the movie plot obtained from wikipedia. 

In [5]:
doc = """Alice follows a large white rabbit down a "Rabbit-hole". She finds a tiny door. When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door. She then eats something labeled "Eat me" and grows larger. She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her. She enters the "White Rabbit's tiny House," but suddenly resumes her normal size. In order to get out, she has to use the "magic fan."
She enters a kitchen, in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. "The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party." After a while, she leaves.
The Queen invites Alice to join the "ROYAL PROCESSION": a parade of marching playing cards and others headed by the White Rabbit. When Alice "unintentionally offends the Queen", the latter summons the "Executioner". Alice "boxes the ears", then flees when all the playing cards come for her. Then she wakes up and realizes it was all a dream."""

Before using the features of spacy,  we need to convert the text document into spacy document using the "nlp" object created. 

In [6]:
spacy_doc = nlp(doc)
spacy_doc

Alice follows a large white rabbit down a "Rabbit-hole". She finds a tiny door. When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door. She then eats something labeled "Eat me" and grows larger. She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her. She enters the "White Rabbit's tiny House," but suddenly resumes her normal size. In order to get out, she has to use the "magic fan."
She enters a kitchen, in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. "The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party." After a while, she leaves.
The Queen invites Alice to join the "ROYAL PROCESSION": a parade of marching playing ca

### 4. Tokenization 

First, we will see how can we perform tokenization at word level and sentence level using spacy. 

**Word Level Tokenization** : To obtain word tokens, we just need to access the spacy document as list, all the tokens will be obtained. This is because when we converted the document into spacy document, this step was already performed. 

In [7]:
tokens = list(spacy_doc)
print (tokens)

[Alice, follows, a, large, white, rabbit, down, a, ", Rabbit, -, hole, ", ., She, finds, a, tiny, door, ., When, she, finds, a, bottle, labeled, ", Drink, me, ", ,, she, does, ,, and, shrinks, ,, but, not, enough, to, pass, through, the, door, ., She, then, eats, something, labeled, ", Eat, me, ", and, grows, larger, ., She, finds, a, fan, when, enables, her, to, shrink, enough, to, get, into, the, ", Garden, ", and, try, to, get, a, ", Dog, ", to, play, with, her, ., She, enters, the, ", White, Rabbit, 's, tiny, House, ,, ", but, suddenly, resumes, her, normal, size, ., In, order, to, get, out, ,, she, has, to, use, the, ", magic, fan, ., ", 
, She, enters, a, kitchen, ,, in, which, there, is, a, cook, and, a, woman, holding, a, baby, ., She, persuades, the, woman, to, give, her, the, child, and, takes, the, infant, outside, after, the, cook, starts, throwing, things, around, ., The, baby, then, turns, into, a, pig, and, squirms, out, of, her, grip, ., ", The, Duchess, 's, Cheshire, C

Similarly, for **sentence level tokenization**, we can acess the sentences using following syntax. It will give a generator object. 

    spacy_doc.sents


In [8]:
for index, sentence in enumerate(spacy_doc.sents):
    print (index, sentence)

0 Alice follows a large white rabbit down a "Rabbit-hole".
1 She finds a tiny door.
2 When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door.
3 She then eats something labeled "Eat me" and grows larger.
4 She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her.
5 She enters the "White Rabbit's tiny House," but suddenly resumes her normal size.
6 In order to get out, she has to use the "magic fan."

7 She enters a kitchen, in which there is a cook and a woman holding a baby.
8 She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around.
9 The baby then turns into a pig and squirms out of her grip.
10 "
11 The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party.
12 " After a while, she leaves.

13 The Queen invites Alice to join the "ROYAL PROCESSI

### 5. Text Cleaning 

In this section, we will see how can we access different properties of tokens produced in spacy object that can be used to remove noise in the text. Different properties of spacy tokens can be viewed using following syntax. 

In [9]:
first_word = tokens[4]   # 'white'
print (dir(first_word))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 

We can see that there are many properties of every token. Let's view some of these properties. 

In [10]:
type(first_word)

spacy.tokens.token.Token

In [11]:
print ("is_bracket: ", first_word.is_bracket)
print ("like_num: ", first_word.like_num)
print ("right_edge: ", first_word.right_edge)
print ("sentiment: ", first_word.sentiment)
print ("dep_: ", first_word.dep_)

is_bracket:  False
like_num:  False
right_edge:  white
sentiment:  0.0
dep_:  amod


Using these properties, we can infact clean the text data. For example, following code cell shows how to remove the tokens which are punctuations and stopwords, and lemmatize the remaining ones. 

#### Removal of Punctionation 

property : is_punct

In [12]:
tokens = [t for t in tokens if (t.is_punct == False)]

#### Removal of Stopwords 

property : is_stop

In [13]:
tokens = [t for t in tokens if (t.is_stop == False)]

#### Token lemmatization

property : lemma_

In [14]:
lemmatized_words = [token.lemma_ for token in tokens]

Finally, we join these lemmatized words to produce a cleaned document. 

In [15]:
cleaned_doc = " ".join(lemmatized_words)
cleaned_doc

'Alice follow large white rabbit rabbit hole find tiny door find bottle label drink shrink pass door eat label eat grow large find fan enable shrink Garden try Dog play enter White Rabbit tiny house suddenly resume normal size order use magic fan \n enter kitchen cook woman hold baby persuade woman child take infant outside cook start throw thing baby turn pig squirm grip Duchess Cheshire Cat appear disappear couple time Alice direct Mad Hatter Mad Tea Party leave \n Queen invite Alice join ROYAL PROCESSION parade march playing card head White Rabbit Alice unintentionally offend Queen summon Executioner Alice box ear flee playing card come wake realize dream'

### 6. Part of Speech Tagging 

Next, we look at how can we obtain part of speech tag associated with every token. 

In [9]:
for sent in spacy_doc.sents:
    for token in sent:
        print ("Token: ", token, "POS Tag: ", token.pos_)
    break

Token:  Alice POS Tag:  PROPN
Token:  follows POS Tag:  VERB
Token:  a POS Tag:  DET
Token:  large POS Tag:  ADJ
Token:  white POS Tag:  ADJ
Token:  rabbit POS Tag:  NOUN
Token:  down POS Tag:  ADP
Token:  a POS Tag:  DET
Token:  " POS Tag:  PUNCT
Token:  Rabbit POS Tag:  NOUN
Token:  - POS Tag:  PUNCT
Token:  hole POS Tag:  NOUN
Token:  " POS Tag:  PUNCT
Token:  . POS Tag:  PUNCT


### 7. Entity Extraction 

Using pos tags, Spacy can extract entities as well. Let's see how 

In [17]:
for ent in spacy_doc.ents:
    if ent.text.strip():
        print ("Entity:", ent.text, "(Label:", ent.label_, ")")

Entity: Garden (Label: WORK_OF_ART )
Entity: the "White Rabbit's (Label: LAW )
Entity: House (Label: ORG )
Entity: The Duchess's Cheshire Cat (Label: WORK_OF_ART )
Entity: Alice (Label: PERSON )
Entity: the Mad Hatter's (Label: LAW )
Entity: Mad Tea-Party (Label: ORG )
Entity: Queen (Label: PERSON )
Entity: Alice (Label: PERSON )
Entity: the White Rabbit (Label: ORG )
Entity: Alice (Label: PERSON )
Entity: Executioner (Label: WORK_OF_ART )


spaCy also provieds a display rendering tool to visualize these entities and their labels. For example : 

In [18]:
spacy.displacy.render(spacy_doc, style='ent', jupyter=True)

### 8. Noun Chunking 

Similar to entity extraction, one can easily extract noun chunks using spacy. 

In [40]:
for idx, sentence in enumerate(spacy_doc.sents):
    for noun in sentence.noun_chunks:
        print(f"sentence {idx+1}", noun)

sentence 1 Alice
sentence 1 a large white rabbit
sentence 1 a "Rabbit-hole
sentence 2 She
sentence 2 a tiny door
sentence 3 she
sentence 3 a bottle
sentence 3 me
sentence 3 she
sentence 3 shrinks
sentence 3 the door
sentence 4 She
sentence 4 something
sentence 4 me
sentence 5 She
sentence 5 a fan
sentence 5 her
sentence 5 the "Garden
sentence 5 a "Dog
sentence 5 her
sentence 6 She
sentence 6 the "White Rabbit's tiny House
sentence 6 her normal size
sentence 7 order
sentence 7 she
sentence 7 the "magic fan
sentence 9 She
sentence 9 a kitchen
sentence 9 a cook
sentence 9 a woman
sentence 9 a baby
sentence 10 She
sentence 10 the woman
sentence 10 her
sentence 10 the child
sentence 10 the infant
sentence 10 the cook
sentence 10 things
sentence 11 The baby
sentence 11 a pig
sentence 11 her grip
sentence 13 The Duchess's Cheshire Cat
sentence 13 a couple
sentence 13 times
sentence 13 Alice
sentence 13 her
sentence 13 the Mad Hatter's "Mad Tea-Party
sentence 15 a while
sentence 15 she
sentenc

### 9. Dependency Parsing

Finally, we look at how can we obtain dependecy grammar relations in the sentences. We will use "dep" property for this purpose. 

In [19]:
for token in list(spacy_doc.sents)[0]:
    print ("Token: ", token.text, "(", token.dep_, ")")

Token:  Alice ( nsubj )
Token:  follows ( ROOT )
Token:  a ( det )
Token:  large ( amod )
Token:  white ( amod )
Token:  rabbit ( dobj )
Token:  down ( prep )
Token:  a ( det )
Token:  " ( punct )
Token:  Rabbit ( compound )
Token:  - ( punct )
Token:  hole ( pobj )
Token:  " ( punct )
Token:  . ( punct )


Let's visualize the grammar tree as well

In [20]:
options = {'compact': True, 'bg': '#09a3d5', 'color': 'white', 'font': 'Trebuchet MS'}
spacy.displacy.render(list(spacy_doc.sents)[0], jupyter=True, style='dep', options=options)

### 10. Vector Notations 

Apart from these features spaCy also provides word vector notations of every token which can be used in machine learning tasks. 

Let's say we have a token = "King", we can obtain its vector notations

In [10]:
token = nlp(u'king')
token.vector

array([ 2.9139571e+00, -1.0016742e+00, -2.0298631e+00,  4.4045860e-01,
       -1.0259339e+00, -1.2732615e+00,  2.3712342e+00,  2.1475315e-02,
       -3.2159476e+00,  3.0835030e+00,  8.9685857e-01, -1.1318004e+00,
       -2.7188659e-04,  9.1702992e-01,  1.1167799e+00, -2.8955281e-01,
       -5.2588457e-01,  1.3412244e+00, -2.5638943e+00,  3.1032321e+00,
       -7.6287359e-02, -7.2169220e-01,  1.8055359e+00, -5.2969289e-01,
       -9.8598915e-01, -9.1488719e-01,  4.5880005e-01, -1.2740700e+00,
        1.5132586e+00, -1.8060904e+00, -1.7065288e+00, -1.3216652e+00,
        4.8007736e+00, -1.3426158e-01, -9.9779117e-01,  4.3230295e+00,
        3.6356062e-01, -1.6250321e+00, -1.2807019e+00, -1.2762814e+00,
        1.4773902e+00,  5.6884813e+00,  1.1288882e+00, -8.6327398e-01,
        7.0138943e-01, -3.8181849e+00,  3.5870168e-01,  4.8530350e+00,
       -1.0350798e+00, -3.5739625e-01, -4.5156498e+00, -8.2374805e-01,
        2.0293598e+00, -3.8062379e-01, -1.6465067e+00,  1.8852234e+00,
      

Let's also print the vector of the token = "man".

In [11]:
token2 = nlp(u'man')
token2.vector

array([ 0.7088207 , -0.37977344, -0.39255697,  1.5627415 , -2.867544  ,
        1.7395227 ,  1.2475766 ,  0.30651206,  1.6436899 ,  3.5380862 ,
        0.19831753, -1.7071538 ,  0.5145009 ,  0.69034517, -0.14347413,
        1.3977473 ,  1.4867508 ,  0.06117857, -2.7757516 ,  1.8932886 ,
        1.6243601 , -0.16163284, -0.17025247, -1.3997722 , -0.15662324,
       -0.79690444, -0.9151953 , -1.0312278 ,  3.9805896 , -1.37117   ,
       -0.91109014, -0.10652265,  1.6504273 , -1.8971703 ,  1.5930436 ,
        1.9742754 ,  0.21027347, -0.28525767, -1.2498834 , -0.3620946 ,
        3.277402  ,  3.273175  , -0.64561605, -0.86335456,  2.0143979 ,
       -2.1200714 ,  1.1332449 ,  1.0214756 , -2.935066  , -1.0889438 ,
       -2.7789528 , -0.67337126,  3.032012  , -1.5588952 , -1.6748934 ,
        0.5104773 ,  0.7180389 , -0.9961107 , -1.346349  , -0.5172305 ,
       -0.25415683, -1.8575006 ,  2.0072165 ,  0.17103535,  1.9901856 ,
        1.7427328 ,  1.8114712 , -0.72808653, -0.6118781 , -0.80

In [12]:
len(token2.vector)

96

We can use these notations and find document similarity. For example : 

In [13]:
token.similarity(token2)

  token.similarity(token2)


0.6450098646874819

The result states that the similarity score of the two documents is 0.645 / 1. 

## End Notes 

This notebook explains the basics of NLP techniques applied using spaCy. These techniques can be used to generate text based features, generate vector notations,  document similarity, improve topic models, imporve machine and deep learning models, and build knowledge graphs. 