# Introduction to **spaCy**

If you want to do natural language processing (NLP) in Python, then look no further than **spaCy**, a free and open-source library with a lot of built-in capabilities. It’s becoming increasingly popular for processing and analyzing data in the field of NLP.

This chapter is an introduction to various aspects of the **spaCy** package.  It is based on the following tutorial: https://realpython.com/natural-language-processing-spacy-python/

## Importing the Package and Language Model

There are various **spaCy** models for different languages. The default model for the English language is designated as `en_core_web_sm`. Since the models are quite large, it’s best to install them separately — including all languages in one package would make the download too massive.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English>

## The `Doc` Object for Processed Text

In this section, you’ll use **spaCy** to deconstruct a given input string.

To start processing your input, you construct a `Doc` object. A `Doc` object is a sequence of `Token` objects representing a lexical token. Each `Token` object has information about a particular piece — typically one word — of text. You can instantiate a `Doc` object by calling the `Language` object with the input string as an argument:

In [None]:
introduction_doc = nlp(
    "This tutorial is about Natural Language Processing in spaCy."
)

We can check the type of the `Doc` object.

In [None]:
type(introduction_doc)

spacy.tokens.doc.Doc

We can use a `list` comprehension to see all the tokens in the `Doc` object.

In [None]:
[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

## Sentence Detection

*Sentence detection* is the process of locating where sentences start and end in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part-of-speech (POS) tagging and named-entity recognition, which you’ll come to later in the tutorial.

In **spaCy**, the `.sents` property is used to extract sentences from the Doc object. Here’s how you would extract the total number of sentences and the sentences themselves for a given input.

Let's start with a simple two sentence piece of text.

In [None]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)

We can create a `Doc` object from `about_text` and then extract the sentences from it.

In [None]:
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)

2

Let's print out the beginning of each sentence.

In [None]:
for sentence in sentences:
    print(f'{sentence[:5]}...')

Gus Proto is a Python...
He is interested in learning...


Each element of a sentence is a `Span` object.

In [None]:
type(sentences[0])

spacy.tokens.span.Span

## Tokens in **spaCy**

Building the `Doc` container involves tokenizing the text. The process of tokenization breaks a text down into its basic units — or tokens — which are represented in **spaCy** as `Token` objects.

As you’ve already seen, with **spaCy**, you can print the tokens by iterating over the `Doc` object. But `Token` objects also have other attributes available for exploration. For instance, the token’s original index position in the string is available as an attribute on `Token`.

Let's begin with our same two sentence piece of text, create a `Doc` from it, and then print each token along with its index position.

In [None]:
about_text

'Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.'

In [None]:
about_doc = nlp(about_text)
for token in about_doc:
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


There are many other pieces of information that can be gleaned from tokens.  In the code below, we use `list` comprehensions and a **pandas** `DataFrame` to display some of this other information.

- `.text_with_ws` prints the token text along with any trailing space, if present.
- `.is_alpha` indicates whether the token consists of alphabetic characters or not.
- `.is_punct` indicates whether the token is a punctuation symbol or not.
- `.is_stop` indicates whether the token is a stop word or not. We'll be covering stop words a bit later in this tutorial.

In [None]:
import pandas as pd
pd.DataFrame({
    'text_whitespace': [str(token.text_with_ws) for token in about_doc],
    'alphanumeric': [str(token.is_alpha) for token in about_doc],
    'punctuation': [str(token.is_punct) for token in about_doc],
    'stop_word': [str(token.is_stop) for token in about_doc],
})

Unnamed: 0,text_whitespace,alphanumeric,punctuation,stop_word
0,Gus,True,False,False
1,Proto,True,False,False
2,is,True,False,True
3,a,True,False,True
4,Python,True,False,False
5,developer,True,False,False
6,currently,True,False,False
7,working,True,False,False
8,for,True,False,True
9,a,True,False,True


## Stop Words

*Stop words* are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. **spaCy** stores a  `set` of stop words for the English language:

In [None]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [None]:
type(spacy_stopwords)

set

In [None]:
len(spacy_stopwords)

326

Let's observe a few of them.

In [None]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

‘re
amongst
sometime
thus
‘ll
here
rather
never
during
‘m


As we can see below, we can remove stop words from text by making use of the `.is_stop` attribute of each token.

In [None]:
custom_about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(custom_about_text)

Let's use a `list` comprehension with a conditional expression to produce a `list` of all the words that are not stop words in `custom_about_text`.

In [None]:
print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


## Lemmatization

*Lemmatization* is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a *lemma*.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories, like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you *normalize* the text.

**spaCy** puts a `lemma_` attribute on the `Token` class. This attribute has the lemmatized form of the token:

In [None]:
conference_help_text = (
    "Gus is helping organize a developer"
    " conference on Applications of Natural Language"
    " Processing. He keeps organizing local Python meetups"
    " and several internal talks at his workplace."
)
conference_help_doc = nlp(conference_help_text)

Let's print out all the words that are different from their lemma.

In [None]:
for token in conference_help_doc:
    if str(token) != str(token.lemma_):
        print(f'{str(token)} : {str(token.lemma_)}')

is : be
He : he
keeps : keep
organizing : organize
meetups : meetup
talks : talk


Notice that this is not perfect, *helping* is not lemmaztized to *help*.

## Word Frequency

We can now convert a given text into tokens and perform statistical analysis on it. This analysis can give you various insights, such as common words or unique words in the text.  In this section we'll do a simple word-frequency analysis.

In [None]:
complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
)
complete_doc = nlp(complete_text)

Let's first try counting word frequencies without removing stop words.  Notice the prominence of uninformative works such as *is* and *a*.

In [None]:
from collections import Counter
Counter(
    [token.text for token in complete_doc if not token.is_punct]
).most_common(5)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]

So next, let's remove the stop words with a `list` comprehension.

In [None]:
words = [
    token.text
    for token in complete_doc
    if not token.is_stop and not token.is_punct
]
print(words)

['Gus', 'Proto', 'Python', 'developer', 'currently', 'working', 'London', 'based', 'Fintech', 'company', 'interested', 'learning', 'Natural', 'Language', 'Processing', 'developer', 'conference', 'happening', '21', 'July', '2019', 'London', 'titled', 'Applications', 'Natural', 'Language', 'Processing', 'helpline', 'number', 'available', '+44', '1234567891', 'Gus', 'helping', 'organize', 'keeps', 'organizing', 'local', 'Python', 'meetups', 'internal', 'talks', 'workplace', 'Gus', 'presenting', 'talk', 'talk', 'introduce', 'reader', 'Use', 'cases', 'Natural', 'Language', 'Processing', 'Fintech', 'Apart', 'work', 'passionate', 'music', 'Gus', 'learning', 'play', 'Piano', 'enrolled', 'weekend', 'batch', 'Great', 'Piano', 'Academy', 'Great', 'Piano', 'Academy', 'situated', 'Mayfair', 'City', 'London', 'world', 'class', 'piano', 'instructors']


Now, counting words is much more meaningful.  We can guess that this text has a lot to do with Gus and natural language processing.  Of course, this is a crude analysis and a lot of context is being excluded.

In [None]:
Counter(words).most_common(5)

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]

## Part-of-Speech Tagging

*Part of speech* or *POS* is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

1. Noun
1. Pronoun
1. Adjective
1. Verb
1. Adverb
1. Preposition
1. Conjunction
1. Interjection

Part-of-speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In **spaCy**, POS tags are available as an attribute on the Token object:

In [None]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(about_text)

In the code below, two attributes of the `Token` class are accessed and printed using `DataFrames` and `list` comprehensions:

1. `.tag_` displays a fine-grained tag.
1. `.pos_` displays a coarse-grained tag, which is a reduced version of the fine-grained tags.
   
We also use `spacy.explain()` to give descriptive details about a particular POS tag, which can be a valuable reference tool.

In [None]:
pd.DataFrame({
    'token': [token for token in about_doc],
    'tag': [token.tag_ for token in about_doc],
    'part_of_speech': [token.pos_ for token in about_doc],
    'explanation': [spacy.explain(token.tag_) for token in about_doc],
})

Unnamed: 0,token,tag,part_of_speech,explanation
0,Gus,NNP,PROPN,"noun, proper singular"
1,Proto,NNP,PROPN,"noun, proper singular"
2,is,VBZ,AUX,"verb, 3rd person singular present"
3,a,DT,DET,determiner
4,Python,NNP,PROPN,"noun, proper singular"
5,developer,NN,NOUN,"noun, singular or mass"
6,currently,RB,ADV,adverb
7,working,VBG,VERB,"verb, gerund or present participle"
8,for,IN,ADP,"conjunction, subordinating or preposition"
9,a,DT,DET,determiner


By using POS tags, you can extract a particular category of words.  You can use this type of word classification to derive insights. For instance, you could gauge sentiment by analyzing which adjectives are most commonly used alongside nouns.

In [None]:
nouns = []
adjectives = []
for token in about_doc:
    if token.pos_ == "NOUN":
        nouns.append(token)
    if token.pos_ == "ADJ":
        adjectives.append(token)

In [None]:
nouns

[developer, company]

In [None]:
adjectives

[interested]

## Preprocessing Functions

To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:

- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words

A preprocessing function converts text to an analyzable format. It’s typical for most NLP tasks. Here’s an example:

In [None]:
complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
)
complete_doc = nlp(complete_text)

Next we create a couple of functions.  The first will help with filtering, the second will do some simple preprocessing.

In [None]:
def is_token_allowed(token):
    return bool(
        str(token).strip()
        and not token.is_stop
        and not token.is_punct
    )

In [None]:
def preprocess_token(token):
    return token.lemma_.strip().lower()

Now we can use a `list` comprehension to filter and process the tokens in `complete_doc`.

In [None]:
complete_filtered_tokens = [
    preprocess_token(token)
    for token in complete_doc
    if is_token_allowed(token)
]

Note that `complete_filtered_tokens` doesn’t contain any stop words or punctuation symbols, and it consists purely of lemmatized lowercase tokens.

In [None]:
print(complete_filtered_tokens)

['gus', 'proto', 'python', 'developer', 'currently', 'work', 'london', 'base', 'fintech', 'company', 'interested', 'learn', 'natural', 'language', 'processing', 'developer', 'conference', 'happen', '21', 'july', '2019', 'london', 'title', 'application', 'natural', 'language', 'processing', 'helpline', 'number', 'available', '+44', '1234567891', 'gus', 'helping', 'organize', 'keep', 'organize', 'local', 'python', 'meetup', 'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk', 'introduce', 'reader', 'use', 'case', 'natural', 'language', 'processing', 'fintech', 'apart', 'work', 'passionate', 'music', 'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch', 'great', 'piano', 'academy', 'great', 'piano', 'academy', 'situate', 'mayfair', 'city', 'london', 'world', 'class', 'piano', 'instructor']


## Named Entity Recognition

*Named-entity recognition* (NER) is the process of locating *named entities* in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, monetary values, percentages, and time expressions.

You can use NER to learn more about the meaning of your text. For example, you could use it to populate tags for a set of documents in order to improve the keyword search. You could also use it to categorize customer support tickets into relevant categories.

**spaCy** has the property `.ents` on `Doc` objects. You can use it to extract named entities:

In [None]:
piano_class_text = (
    "Great Piano Academy is situated"
    " in Mayfair or the City of London and has"
    " world-class piano instructors."
)
piano_class_doc = nlp(piano_class_text)

Notice that `ent` is a `Span` object with various attributes:

- `.text` gives the Unicode text representation of the entity.
- `.label_` gives the label of the entity.

`spacy.explain` gives descriptive details about each entity label.

In [None]:
pd.DataFrame({
    'entity': [ent.text for ent in piano_class_doc.ents],
    'label': [ent.label_ for ent in piano_class_doc.ents],
    'explanation': [spacy.explain(ent.label_) for ent in piano_class_doc.ents],
})

Unnamed: 0,entity,label,explanation
0,Great Piano Academy,ORG,"Companies, agencies, institutions, etc."
1,Mayfair,GPE,"Countries, cities, states"
2,the City of London,GPE,"Countries, cities, states"
