## Introduction to SpaCy

The spaCy library is one of the most popular NLP libraries along with NLTK. The basic difference between the two libraries is the fact that NLTK contains a wide variety of algorithms to solve one problem whereas spaCy contains only one, but the best algorithm to solve a problem.

NLTK was released back in 2001 while spaCy is relatively new and was developed in 2015. In this series of articles on NLP, we will mostly be dealing with spaCy, owing to its state of the art nature. However, we will also touch NLTK when it is easier to perform a task using NLTK rather than spaCy.

In [1]:
!pip install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.2.4)


Once you download and install spaCy, the next step is to download the language model. We will be using the English language model. The language model is used to perform a variety of NLP tasks, which we will see in a later section.

In [2]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## Basic Functionality

In [0]:
import spacy
sp = spacy.load('en_core_web_sm')

In [4]:
sp

<spacy.lang.en.English at 0x7f587c4d8a20>

Let's now create a small document using this model. A document can be a sentence or a group of sentences and can have unlimited length. The following script creates a simple spaCy document.

In [0]:
sentence = sp(u'Manchester United is looking to sign a forward for $90 million')

In [6]:
sentence

Manchester United is looking to sign a forward for $90 million

SpaCy automatically breaks your document into tokens when a document is created using the model.

A token simply refers to an individual part of a sentence having some semantic value. Let's see what tokens we have in our document:

In [7]:
for word in sentence:
    print(word.text)

Manchester
United
is
looking
to
sign
a
forward
for
$
90
million


You can see we have the following tokens in our document. We can also see the parts of speech of each of these tokens using the .pos_ attribute shown below:

In [8]:
for word in sentence:
    print(word.text,  word.pos_)

Manchester PROPN
United PROPN
is AUX
looking VERB
to PART
sign VERB
a DET
forward NOUN
for ADP
$ SYM
90 NUM
million NUM


Finally, in addition to the parts of speech, we can also see the dependencies.

In [0]:
sentence2 = sp(u"Manchester United isn't looking to sign any forward.")

In [10]:
for word in sentence2:
    print(word.text,  word.pos_, word.dep_)

Manchester PROPN compound
United PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
to PART aux
sign VERB xcomp
any DET advmod
forward ADV advmod
. PUNCT punct


From the output, you can see that spaCy is intelligent enough to find the dependency between the tokens, for instance in the sentence we had a word is'nt. The depenency parser has broken it down to two words and specifies that the n't is actually negation of the previous word.

In addition to printing the words, you can also print sentences from a document.

In [0]:
document = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')

In [12]:
for sentence in document.sents:
    print(sentence)

Hello from Stackabuse.
The site with the best Python Tutorials.
What are you looking for?


In [13]:
document[4].is_sent_start

True

## Tokenization

As explained earlier, tokenization is the process of breaking a document down into words, punctuation marks, numeric digits, etc.

In [14]:
sentence3 = sp(u'"They\'re leaving U.K. for U.S.A."')
print(sentence3)

"They're leaving U.K. for U.S.A."


In [15]:
for word in sentence3:
    print(word.text)

"
They
're
leaving
U.K.
for
U.S.A.
"


In the output, you can see that spaCy has tokenized the starting and ending double quotes. However, it is intelligent enough, not to tokenize the punctuation dot used between the abbreviations such as U.K. and U.S.A.

In [16]:
sentence4 = sp(u"Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com")
print(sentence4)

Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com


In [17]:
for word in sentence4:
    print(word.text)

Hello
,
I
am
non
-
vegetarian
,
email
me
the
menu
at
abc-xyz@gmai.com


It is evident from the output that spaCy was actually able to detect the email and it did not tokenize it despite having a "-". On the other hand, the word "non-vegetarian" was tokenized.

In [18]:
len(sentence4)

14

In the output, you will see 14, which is the number of tokens in the sentence4.

## Detecting Entities

Let's see a simple example of named entity recognition:

In [0]:
sentence5 = sp(u'Manchester United is looking to sign Harry Kane for $90 million')  

In [20]:
for word in sentence5:
    print(word.text)

Manchester
United
is
looking
to
sign
Harry
Kane
for
$
90
million


We know that "Manchester United" is a single word, therefore it should not be tokenized into two words. Similarly, "Harry Kane" is the name of a person, and "$90 million" is a currency value. These should not be tokenized either.

This is where named entity recognition comes to play. To get the named entities from a document, you have to use the ents attribute. Let's retrieve the named entities from the above sentence. Execute the following script:

In [22]:
for entity in sentence5.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - PERSON - People, including fictional
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


You can see that spaCy's named entity recognizer has successfully recognized "Manchester United" as an organization, "Harry Kane" as a person and "$90 million" as a currency value.

## Detecting Nouns

In [0]:
sentence5 = sp(u'Latest Rumours: Manchester United is looking to sign Harry Kane for $90 million')

In [24]:
for noun in sentence5.noun_chunks:
    print(noun.text)

Manchester United
Harry Kane


## Stemming

Stemming refers to reducing a word to its root form. While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.

There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Both of them have been implemented using different algorithms.

## Porter Stemmer

In [0]:
import nltk

from nltk.stem.porter import *

In [0]:
stemmer = PorterStemmer()

In [0]:
tokens = ['compute', 'computer', 'computed', 'computing']

In [28]:
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that all the 4 words have been reduced to "comput" which actually isn't a word at all.

## Snowball Stemmer

In [29]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


## Lemmatization

In [0]:
sentence6 = sp(u'compute computer computed computing')

In [31]:
for word in sentence6:
    print(word.text,  word.lemma_)

compute compute
computer computer
computed compute
computing computing


Lemmatization converts words in the second or third forms to their first form variants. Look at the following example:

In [32]:
sentence7 = sp(u'A letter has been written, asking him to be released')

for word in sentence7:
    print(word.text + '  ===>', word.lemma_)

A  ===> a
letter  ===> letter
has  ===> have
been  ===> be
written  ===> write
,  ===> ,
asking  ===> ask
him  ===> -PRON-
to  ===> to
be  ===> be
released  ===> release
