<a href="https://colab.research.google.com/github/pulkitmehtawork/ML_Practice/blob/master/NER_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition and POS tagging using spaCy 

This notebook explains basic text preprocessing, parts-of-speech tagging and named entity recognition using python's library - spaCy

### Contents 

1. About spaCy  
2. Installation  
3. Dataset Preparation for spaCy  
4. Tokenization - Word and Sentences  
5. Text Cleaning - Stopwords, Punctuations, Lemmatization  
6. Part of Speech Tagging  
7. Entity Extraction  
8. Noun Phrase Chunking  

### 1. About spaCy 

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. Following are the key features of spaCy : 

- Non-destructive tokenization
- Named entity recognition
- Support for 34+ languages
- 13 statistical models for 8 languages
- Pre-trained word vectors
- Easy deep learning integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Built in visualizers for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy model packaging and deployment
- State-of-the-art speed
- Robust, rigorously evaluated accuracy  

With so many features inbuilt, spaCy is considered as one of the powerful NLP library. 

### 2. Installation 

To install spaCy, two steps are required, 

2.1 Install the spaCy source using pip command

    pip install spacy

2.2 Download the spacy pre-trained and annotated models  

    python -m spacy download en_core_web_sm  
    
### 3. Dataset Preparation  

We first load the required libraries and prepare the data for spaCy usage.  We will load the spacy models (which were downloaded duing the installation step) and create an object "nlp". 

In [1]:
!pip install spacy



In [3]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

Consider that we have a text document obtained the movie plot obtained from wikipedia. 

In [0]:
doc = """Alice follows a large white rabbit down a "Rabbit-hole". She finds a tiny door. When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door. She then eats something labeled "Eat me" and grows larger. She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her. She enters the "White Rabbit's tiny House," but suddenly resumes her normal size. In order to get out, she has to use the "magic fan."
She enters a kitchen, in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. "The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party." After a while, she leaves.
The Queen invites Alice to join the "ROYAL PROCESSION": a parade of marching playing cards and others headed by the White Rabbit. When Alice "unintentionally offends the Queen", the latter summons the "Executioner". Alice "boxes the ears", then flees when all the playing cards come for her. Then she wakes up and realizes it was all a dream."""

Before using the features of spacy,  we need to convert the text document into spacy document using the "nlp" object created. 

In [6]:
spacy_doc = nlp(doc)
spacy_doc

Alice follows a large white rabbit down a "Rabbit-hole". She finds a tiny door. When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door. She then eats something labeled "Eat me" and grows larger. She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her. She enters the "White Rabbit's tiny House," but suddenly resumes her normal size. In order to get out, she has to use the "magic fan."
She enters a kitchen, in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. "The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party." After a while, she leaves.
The Queen invites Alice to join the "ROYAL PROCESSION": a parade of marching playing ca

### 4. Tokenization 

First, we will see how can we perform tokenization at word level and sentence level using spacy. 

**Word Level Tokenization** : To obtain word tokens, we just need to access the spacy document as list, all the tokens will be obtained. This is because when we converted the document into spacy document, this step was already performed. 

In [7]:
tokens = list(spacy_doc)
print (tokens)

[Alice, follows, a, large, white, rabbit, down, a, ", Rabbit, -, hole, ", ., She, finds, a, tiny, door, ., When, she, finds, a, bottle, labeled, ", Drink, me, ", ,, she, does, ,, and, shrinks, ,, but, not, enough, to, pass, through, the, door, ., She, then, eats, something, labeled, ", Eat, me, ", and, grows, larger, ., She, finds, a, fan, when, enables, her, to, shrink, enough, to, get, into, the, ", Garden, ", and, try, to, get, a, ", Dog, ", to, play, with, her, ., She, enters, the, ", White, Rabbit, 's, tiny, House, ,, ", but, suddenly, resumes, her, normal, size, ., In, order, to, get, out, ,, she, has, to, use, the, ", magic, fan, ., ", 
, She, enters, a, kitchen, ,, in, which, there, is, a, cook, and, a, woman, holding, a, baby, ., She, persuades, the, woman, to, give, her, the, child, and, takes, the, infant, outside, after, the, cook, starts, throwing, things, around, ., The, baby, then, turns, into, a, pig, and, squirms, out, of, her, grip, ., ", The, Duchess, 's, Cheshire, C

Similarly, for **sentence level tokenization**, we can acess the sentences using following syntax. It will give a generator object. 

    spacy_doc.sents


In [8]:
for index, sentence in enumerate(spacy_doc.sents):
    print (index, sentence)

0 Alice follows a large white rabbit down a "Rabbit-hole".
1 She finds a tiny door.
2 When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door.
3 She then eats something labeled "Eat me" and grows larger.
4 She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her.
5 She enters the "White Rabbit's tiny House," but suddenly resumes her normal size.
6 In order to get out, she has to use the "magic fan.
7 "

8 She enters a kitchen, in which there is a cook and a woman holding a baby.
9 She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around.
10 The baby then turns into a pig and squirms out of her grip.
11 "
12 The Duchess's Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's "Mad Tea-Party.
13 "
14 After a while, she leaves.

15 The Queen invites Alice to join the "ROYAL P

### 5. Text Cleaning 

In this section, we will see how can we access different properties of tokens produced in spacy object that can be used to remove noise in the text. Different properties of spacy tokens can be viewed using following syntax. 

In [9]:
first_word = tokens[0]
print (dir(first_word))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'nor

We can see that there are many properties of every token. Let's view some of these properties. 

In [10]:
print ("is_bracket: ", first_word.is_bracket)
print ("like_num: ", first_word.like_num)
print ("right_edge: ", first_word.right_edge)
print ("sentiment: ", first_word.sentiment)
print ("dep_: ", first_word.dep_)

is_bracket:  False
like_num:  False
right_edge:  Alice
sentiment:  0.0
dep_:  nsubj


Using these properties, we can infact clean the text data. For example, following code cell shows how to remove the tokens which are punctuations and stopwords, and lemmatize the remaining ones. 

#### Removal of Punctionation 

property : is_punct

In [0]:
tokens = [t for t in tokens if (t.is_punct == False)]

In [12]:
print(tokens)

[Alice, follows, a, large, white, rabbit, down, a, Rabbit, hole, She, finds, a, tiny, door, When, she, finds, a, bottle, labeled, Drink, me, she, does, and, shrinks, but, not, enough, to, pass, through, the, door, She, then, eats, something, labeled, Eat, me, and, grows, larger, She, finds, a, fan, when, enables, her, to, shrink, enough, to, get, into, the, Garden, and, try, to, get, a, Dog, to, play, with, her, She, enters, the, White, Rabbit, 's, tiny, House, but, suddenly, resumes, her, normal, size, In, order, to, get, out, she, has, to, use, the, magic, fan, 
, She, enters, a, kitchen, in, which, there, is, a, cook, and, a, woman, holding, a, baby, She, persuades, the, woman, to, give, her, the, child, and, takes, the, infant, outside, after, the, cook, starts, throwing, things, around, The, baby, then, turns, into, a, pig, and, squirms, out, of, her, grip, The, Duchess, 's, Cheshire, Cat, appears, and, disappears, a, couple, of, times, to, Alice, and, directs, her, to, the, Mad, 

#### Removal of Stopwords 

property : is_stop

In [0]:
tokens = [t for t in tokens if (t.is_stop == False)]

In [14]:
print(tokens)

[Alice, follows, large, white, rabbit, Rabbit, hole, finds, tiny, door, finds, bottle, labeled, Drink, shrinks, pass, door, eats, labeled, Eat, grows, larger, finds, fan, enables, shrink, Garden, try, Dog, play, enters, White, Rabbit, tiny, House, suddenly, resumes, normal, size, order, use, magic, fan, 
, enters, kitchen, cook, woman, holding, baby, persuades, woman, child, takes, infant, outside, cook, starts, throwing, things, baby, turns, pig, squirms, grip, Duchess, Cheshire, Cat, appears, disappears, couple, times, Alice, directs, Mad, Hatter, Mad, Tea, Party, leaves, 
, Queen, invites, Alice, join, ROYAL, PROCESSION, parade, marching, playing, cards, headed, White, Rabbit, Alice, unintentionally, offends, Queen, summons, Executioner, Alice, boxes, ears, flees, playing, cards, come, wakes, realizes, dream]


#### Token lemmatization

property : lemma_

In [0]:
lemmatized_words = [token.lemma_ for token in tokens]

In [16]:
print(lemmatized_words)

['Alice', 'follow', 'large', 'white', 'rabbit', 'rabbit', 'hole', 'find', 'tiny', 'door', 'find', 'bottle', 'label', 'drink', 'shrink', 'pass', 'door', 'eat', 'label', 'eat', 'grow', 'large', 'find', 'fan', 'enable', 'shrink', 'Garden', 'try', 'Dog', 'play', 'enter', 'White', 'Rabbit', 'tiny', 'House', 'suddenly', 'resume', 'normal', 'size', 'order', 'use', 'magic', 'fan', '\n', 'enter', 'kitchen', 'cook', 'woman', 'hold', 'baby', 'persuade', 'woman', 'child', 'take', 'infant', 'outside', 'cook', 'start', 'throw', 'thing', 'baby', 'turn', 'pig', 'squirm', 'grip', 'Duchess', 'Cheshire', 'Cat', 'appear', 'disappear', 'couple', 'time', 'Alice', 'direct', 'Mad', 'Hatter', 'mad', 'Tea', 'Party', 'leave', '\n', 'Queen', 'invite', 'Alice', 'join', 'ROYAL', 'procession', 'parade', 'march', 'playing', 'card', 'head', 'White', 'Rabbit', 'Alice', 'unintentionally', 'offend', 'queen', 'summon', 'Executioner', 'Alice', 'box', 'ear', 'flee', 'playing', 'card', 'come', 'wake', 'realize', 'dream']


Finally, we join these lemmatized words to produce a cleaned document. 

In [17]:
cleaned_doc = " ".join(lemmatized_words)
cleaned_doc

'Alice follow large white rabbit rabbit hole find tiny door find bottle label drink shrink pass door eat label eat grow large find fan enable shrink Garden try Dog play enter White Rabbit tiny House suddenly resume normal size order use magic fan \n enter kitchen cook woman hold baby persuade woman child take infant outside cook start throw thing baby turn pig squirm grip Duchess Cheshire Cat appear disappear couple time Alice direct Mad Hatter mad Tea Party leave \n Queen invite Alice join ROYAL procession parade march playing card head White Rabbit Alice unintentionally offend queen summon Executioner Alice box ear flee playing card come wake realize dream'

### 6. Part of Speech Tagging 

Next, we look at how can we obtain part of speech tag associated with every token. 

In [18]:
for sent in spacy_doc.sents:
    for token in sent:
        print ("Token: ", token, "POS Tag: ", token.pos_)
    break

Token:  Alice POS Tag:  PROPN
Token:  follows POS Tag:  VERB
Token:  a POS Tag:  DET
Token:  large POS Tag:  ADJ
Token:  white POS Tag:  ADJ
Token:  rabbit POS Tag:  NOUN
Token:  down POS Tag:  ADP
Token:  a POS Tag:  DET
Token:  " POS Tag:  PUNCT
Token:  Rabbit POS Tag:  NOUN
Token:  - POS Tag:  PUNCT
Token:  hole POS Tag:  PROPN
Token:  " POS Tag:  PUNCT
Token:  . POS Tag:  PUNCT


### 7. Entity Extraction 

Using pos tags, Spacy can extract entities as well. Let's see how 

In [19]:
for ent in spacy_doc.ents:
    if ent.text.strip():
        print ("Entity:", ent.text, "(Label:", ent.label_, ")")

Entity: the "Garden" (Label: FAC )
Entity: a "Dog (Label: WORK_OF_ART )
Entity: the "White Rabbit's (Label: FAC )
Entity: House (Label: ORG )
Entity: The Duchess's Cheshire Cat (Label: WORK_OF_ART )
Entity: Alice (Label: PERSON )
Entity: the Mad Hatter's (Label: ORG )
Entity: Mad Tea-Party (Label: WORK_OF_ART )
Entity: Queen (Label: PERSON )
Entity: Alice (Label: PERSON )
Entity: the White Rabbit (Label: ORG )
Entity: Alice (Label: PERSON )
Entity: Queen (Label: PERSON )
Entity: Executioner (Label: WORK_OF_ART )
Entity: Alice (Label: PERSON )


spaCy also provieds a display rendering tool to visualize these entities and their labels. For example : 

In [20]:
spacy.displacy.render(spacy_doc, style='ent', jupyter=True)

### 8. Noun Chunking 

Similar to entity extraction, one can easily extract noun chunks using spacy. 

In [21]:
for idx, sentence in enumerate(spacy_doc.sents):
    for noun in sentence.noun_chunks:
        print(f"sentence {idx+1}, noun chunk '{noun}'")

sentence 1, noun chunk 'Alice'
sentence 1, noun chunk 'a large white rabbit'
sentence 1, noun chunk 'a "Rabbit-hole'
sentence 2, noun chunk 'She'
sentence 2, noun chunk 'a tiny door'
sentence 3, noun chunk 'she'
sentence 3, noun chunk 'a bottle'
sentence 3, noun chunk 'me'
sentence 3, noun chunk 'she'
sentence 3, noun chunk 'shrinks'
sentence 3, noun chunk 'the door'
sentence 4, noun chunk 'She'
sentence 4, noun chunk 'something'
sentence 4, noun chunk 'me'
sentence 5, noun chunk 'She'
sentence 5, noun chunk 'a fan'
sentence 5, noun chunk 'her'
sentence 5, noun chunk 'the "Garden'
sentence 5, noun chunk 'a "Dog'
sentence 5, noun chunk 'her'
sentence 6, noun chunk 'She'
sentence 6, noun chunk 'the "White Rabbit's tiny House'
sentence 6, noun chunk 'her normal size'
sentence 7, noun chunk 'order'
sentence 7, noun chunk 'she'
sentence 7, noun chunk 'the "magic fan'
sentence 9, noun chunk 'She'
sentence 9, noun chunk 'a kitchen'
sentence 9, noun chunk 'a cook'
sentence 9, noun chunk 'a wom