# NLP - Session 4 - SpaCy Introduction for NLP | Linguistic Features Extraction

#### Getting Started with spaCy
This tutorial is a crisp and effective introduction to spaCy and the various NLP linguistic features it offers.We will perform several NLP related tasks, such as Tokenization, part-of-speech tagging, named entity recognition, dependency parsing and Visualization using displaCy.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.spaCy is designed specifically for production use and helps you build applications that process and understand large volumes of text. It’s written in Cython and is designed to build information extraction or natural language understanding systemor to pre-process text for deep learning.

#### Linguistic Features in spaCy
Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing.

That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of Linguistic annotations.

spaCy acts as a one-stop-shop for various tasks used in NLP projects, such as Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Name entity recognition, Dependency parsing, Sentence Segmentation, Word-to-vector transformations, and other cleaning and normalization text methods.

![image-3.png](attachment:image-3.png)

![image.png](attachment:image.png)

#### Download Below Packages

`conda install --name tensorflow20 -c conda-forge spacy`

`conda install --name tensorflow20 -c conda-forge spacy-model-en_core_web_sm`

`conda install --name tensorflow20 -c conda-forge spacy-lookups-data`

In [1]:
import spacy

`en_core_web_sm` -  Is a SpaCy trained model. 

In [2]:
nlp = spacy.load("en_core_web_sm")

## Tokenization
Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a unicode text and the output is a Doc object.

A Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.

In [3]:
doc = nlp("Apple is looking at buying U.K startup for $1 billion")

In [4]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K
startup
for
$
1
billion


Now look how SpaCy tokenizes `"isn't"` into `"is"` and `"n't"`. This shows it is a very smart model.

In [5]:
doc = nlp("Apple isn't looking at buying U.K startup for $1 billion")

In [6]:
for token in doc:
    print(token.text)

Apple
is
n't
looking
at
buying
U.K
startup
for
$
1
billion


## Lemmatization
A work-related to tokenization, lemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps to reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

In [7]:
doc

Apple isn't looking at buying U.K startup for $1 billion

In [8]:
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
n't not
looking look
at at
buying buy
U.K U.K
startup startup
for for
$ $
1 1
billion billion


Formatting the output for better view.

In [9]:
for token in doc:
    print(f"{token.text:{15}} {token.lemma_:{15}}")

Apple           Apple          
is              be             
n't             not            
looking         look           
at              at             
buying          buy            
U.K             U.K            
startup         startup        
for             for            
$               $              
1               1              
billion         billion        


## Part-of-speech tagging [POS]
Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.

In [10]:
for token in doc:
    print(f"{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}}")

Apple           Apple           PROPN     
is              be              AUX       
n't             not             PART      
looking         look            VERB      
at              at              ADP       
buying          buy             VERB      
U.K             U.K             PROPN     
startup         startup         NOUN      
for             for             ADP       
$               $               SYM       
1               1               NUM       
billion         billion         NUM       


## Stopwords
 - 0 : False
 - 1 : True

In [11]:
for token in doc:
    print(f"{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop:{10}}")

Apple           Apple           PROPN               0
is              be              AUX                 1
n't             not             PART                1
looking         look            VERB                0
at              at              ADP                 1
buying          buy             VERB                0
U.K             U.K             PROPN               0
startup         startup         NOUN                0
for             for             ADP                 1
$               $               SYM                 0
1               1               NUM                 0
billion         billion         NUM                 0


## Dependency Parsing
Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head.To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

In [12]:
for chunk in doc.noun_chunks:
    print(f"{chunk.text:{15}} {chunk.root.text:{15}} {chunk.root.dep_}")

Apple           Apple           nsubj
U.K startup     startup         dobj


## Named Entity Recognition
Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

It is used to populate tags for a set of documents in order to improve the keyword search. Named entities are available as the ents property of a Doc.

In [13]:
for ent in doc.ents:
    print(f"{ent.text:{15}} {ent.label_:{15}}")

Apple           ORG            
U.K             ORG            
$1 billion      MONEY          


## Sentence Segmentation
Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units.SpaCy uses the dependency parse to determine sentence boundaries. In spaCy, the sents property is used to extract sentences.

In [14]:
for sent in doc.sents:
    print(sent)

Apple isn't looking at buying U.K startup for $1 billion


In [15]:
doc_1 = nlp("Welcome to the world. Thanks for being here. Please love and prosper")
doc_1

Welcome to the world. Thanks for being here. Please love and prosper

In [16]:
for sent in doc_1.sents:
    print(sent)

Welcome to the world.
Thanks for being here.
Please love and prosper


SpaCy has a predefined set of rules by which it segments the sentences. You can create your own rules if the predifened rules doesn't work.

As you can see how SpaCy doesn't properly segments the below statement.

In [17]:
doc_1 = nlp("Welcome to the...world...Thanks for being here.")
doc_1

Welcome to the...world...Thanks for being here.

In [18]:
for sent in doc_1.sents:
    print(sent)

Welcome to the...world...
Thanks for being here.


#### Defining custom rule.

In [19]:
def set_rule(doc):
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i + 1].is_sent_start = True
    return doc

In [20]:
nlp.add_pipe(set_rule, before="parser")
doc_1 = nlp("Welcome to the...world...Thanks for being here.")

In [21]:
for sent in doc_1.sents:
    print(sent)

Welcome to the...
world...
Thanks for being here.


In [22]:
# Only when rerunning we need to remove it
nlp.remove_pipe("set_rule")

('set_rule', <function __main__.set_rule(doc)>)

## Visualization
SpaCy comes with a built-in visualizer called `displaCy`. We can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can pass a Doc or a list of Doc objects to displaCy and run `displacy.serve` to run the web server, or `displacy.render` to generate the raw markup.

#### Visualizing the dependency parse
The dependency visualizer, `dep`, shows part-of-speech tags and syntactic dependencies.
The argument `options` lets you specify a dictionary of settings to customize the layout.

In [23]:
from spacy import displacy

In [24]:
doc

Apple isn't looking at buying U.K startup for $1 billion

In [25]:
displacy.render(doc, style="dep")

In [26]:
displacy.render(doc, style="dep", options={"compact": True, "distance": 100})

## Visualizing the entity recognizer
The entity visualizer, `ent`, highlights named entities and their labels in a text.

In [27]:
displacy.render(doc, style="ent")