## Importing Spacy

SpaCy is a library that allows you to perform many NLP tasks, including preprocessing text by exploiting not only a rule based approach, but also some pretrained models, for example to split a text into sentences, or find the named entities contained in the text.

To take advantage of it, you need to download it as explained in the following cell. For english, a common choice is `en_core_web_sm`, but also `en_core_web_lg` results in a good performance while being lightweight.

In [2]:
# !conda install -c conda-forge spacy
# !python -m spacy download en_core_web_sm
## I recommend the one above, because the following is more accurate but less efficient
# !python -m spacy download en_core_web_lg

In [3]:
import spacy


nlp = spacy.load("en_core_web_sm")
# You can also load en_core_web_lg that has an higher accuracy but it's less efficient
# nlp = spacy.load("en_core_web_lg")



By running `nlp.pipeline` you can see what are the steps that spaCy automatically does for you. 

In [4]:
print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000002034992FFA0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000002034992F160>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x00000203499B0040>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x00000203499C38C0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x00000203499B6C40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x00000203499B0190>)]


To use it, if you have loaded the model in a variable named `nlp` as above (`nlp = spacy.load("en_core_web_sm")`), you can just pass the text you want as argument as follows:

In [5]:
# Process sentences 'Hello, world. Antonio is learning Python.' using spaCy
doc = nlp(u"Hello, world. Antonio is learning Python.")

By doing so, a lot of things happened! You can iterate over the "doc" and you will get the tokens of the text:

In [6]:
for token in doc:
    print(token.text)

Hello
,
world
.
Antonio
is
learning
Python
.


In [7]:
# Get first token of the processed document
token = doc[0]
print(token)

Hello


The spaCy model automatically divide the text in sentences:

In [8]:
# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

Hello, world.
Antonio is learning Python.


It could look a trivial task, like it was a fancy way of doing `text.split(".")`. However, how would you split into sentences the text "Im antonio im learning python"?

We know that "im" is a wrong way to write "I'm" and, since there are two verbs in that text, there are also two sentences. Let's see how spaCy performs:

In [9]:
doc = nlp("Im antonio im learning python")
for sentence in doc.sents:
    print(sentence)

Im antonio im learning python


Nice! However, don't get used to this, because there are tons of more complicated sentences that can easily be misinterpreted.

In the example below, you can see that spaCy is smart enough to consider N.Y. as a single token, and not as two:

In [10]:
tokens = nlp("Let's go to N.Y.!")

In [11]:
for token in tokens:
    print(token.text)

Let
's
go
to
N.Y.
!


As you have seen, using `nlp`, that comes from `spacy.load("en_core_web_sm")`, you get the tokenized version of the sentence. If you want only the instance of the `Tokenizer` class, you can run:

In [12]:
tokenizer = nlp.tokenizer
type(tokenizer)

spacy.tokenizer.Tokenizer

If you want to instantiate a custom one, with rules and prefixes and so on:

In [26]:
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(vocab=nlp.vocab)

The tokenizer defined above contains only english rules.
Let's test it on "Let's go to N.Y.!"

In [27]:
tokens = tokenizer("Let's go to N.Y.!")
for token in tokens:
    print(token)



Let's
go
to
N.Y.!


As you can see here, it doesn't handle the exceptions about the dots. 

Looking at the output of `nlp.pipeline` above, we can see there are a tagger, a dependency parser and the entity recognizer. Let's check the entities of the following sentence:

In [15]:
doc = nlp("Apple is a $1000b company.")


In [16]:
for token in doc:
    print(token)

Apple
is
a
$
1000b
company
.


In [17]:
for ent in doc.ents:
    print(ent, ent.label_)

Apple ORG
1000b DATE


## Removing stop words

In general, it's convenient to remove all the stop words, *i.e. very common words in a language*, because they don't help most of NLP problem such as semantic analysis.

In [18]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print("Number of stop words: %d" % len(spacy_stopwords))
print("First ten stop words: %s" % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['say', 'this', 'no', 'herein', 'enough', 'there', 'and', 'thereafter', 'really', 'although']


To remove them:

In [19]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

doc = nlp(text)

tokens = [token.text for token in doc if not token.is_stop]
for token in tokens:
    print(token)

determined
drop
litigation
monastry
,
relinguish
claims
wood
-
cuting


fishery
rihgts
.
ready
becuase
rights
valuable
,


vaguest
idea
wood
river
question
.


For adding customized stop words:

In [20]:
customize_stop_words = ["computing", "filtered"]
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

## Stemming and Lemmatization

In most natural languages, a root word can have many variants. For example, the word ‘play’ can be used as ‘playing’, ‘played’, ‘plays’, etc. You can think of similar examples (and there are plenty).

**Stemming**

Let’s first understand stemming:

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
It is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word
 

**Lemmatization**

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Stemming algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth linguistic knowledge. Hence, Lemmatization helps in forming better features.

In [22]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_entities")
# not using merge_chunk_nouns
doc = nlp(
    u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""
)

lemma_word1 = []
for token in doc:
    if token.is_stop:
        continue
    lemma_word1.append(token.lemma_)
lemma_word1

['determine',
 'drop',
 'litigation',
 'monastry',
 ',',
 'relinguish',
 'claim',
 'wood',
 '-',
 'cut',
 '\n',
 'fishery',
 'rihgts',
 '.',
 'ready',
 'becuase',
 'right',
 'valuable',
 ',',
 '\n',
 'vague',
 'idea',
 'wood',
 'river',
 'question',
 '.']

## Removing the punctuation



In [24]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""


import string

text_no_punct = "".join([char for char in text if char not in string.punctuation])

text_no_punct

'He determined to drop his litigation with the monastry and relinguish his claims to the woodcuting and \nfishery rihgts at once He was the more ready to do this becuase the rights had become much less valuable and he had \nindeed the vaguest idea where the wood and river in question were'

In [None]:
doc = nlp(text_no_punct)
for token in doc:
    print(token)

For text extracted from dialogues or chats, it is convenient to preprocess the text so that multiple occurrences of the same characters get condensed into one or two, and then use a spell checker to find the correct form of the word.

A way to do that is to replace all the occurrences of repeated characters with a single one and then use a spell checker: "hhheeelllllooo hoooowww areee youuu?" becomes "helo how are you?" and then the spell checker would make it "hello how are you?"




In [23]:
st = "hhheeeLLLLooo hoooowww areee youuu?????"
text = re.sub(r"(.)\1+", r"\1", st)
text

NameError: name 're' is not defined

In [None]:
from spellchecker import SpellChecker

text = nlp(text)
spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown([token.text for token in text])


for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

It didn't find any mispelled (even if there was "helo"). Try another spell checker:

https://github.com/fsondej/autocorrect

In [None]:
from autocorrect import Speller

spell = Speller()

spell(text.text)

As you can see, it's not always working properly! However, overall it should improve your text.

If you want to create a separate lemmatizer instead of having it in the pipeline:

**For spacy before v3**

In [None]:
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

lemmatizer = nlp.vocab.morphology.lemmatizer
print(lemmatizer("studying", VERB))
print(lemmatizer("studying", NOUN))
print(lemmatizer("studying", ADJ))

# or as alternative

print(lemmatizer.verb("studying"))
print(lemmatizer.noun("studying"))
print(lemmatizer.adj("studying"))

spaCy has no built-in stemming! However, Lemmatization is enough for most of the tasks. As alternative, you can use [NLTK library](https://www.nltk.org).

## Part of Speech (POS) Tagging

Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.

In [None]:
sentence = nlp("Antonio is learning Python in Strive School.")

for token in sentence:
    print(token.pos_)

The `.pos_` attribute gives the *coarse-grained* POS tag. To inspect the *fine-grained* POS tags we could use the `.tag_`attribute:

In [None]:
sentence = nlp("Antonio is learning Python in Strive School.")

for token in sentence:
    print(token.tag_)

While the output of the `.pos_` attribute is easy to decrypt (`PROPN`: proper noun,
`AUX`: Auxiliary verb,
`VERB`: verb,
`ADP`: Adposition,
`PUNCT`: Punctuation), the `.tag_`'s output is more cryptic. For this, you can use the `spacy.explain()` function to get the intuition behind that:


In [None]:
for token in sentence:
    print(spacy.explain(token.tag_))

Go and dig up your primary school grammar book!

Let's put everything together:

In [None]:
for token in sentence:
    print(f'{token.text:{12}} {token.pos_:{10}} {token.tag_:{8}} {spacy.explain(token.tag_)}')


(the numbers between curly brackets define spaces for a better formatting).

You can count the number of occurrences of each POS tag by calling the `count_by` method. 

The syntax is as follows (you need to pass `spacy.attrs.POS` as argument of the method):

In [None]:
sentence = nlp("Antonio is learning Python Programming Language")

num_pos = sentence.count_by(spacy.attrs.POS)
num_pos

The keys of the vocabulary are the ID of the POS tags, the values are their frequencies of occurrence. To retrieve the POS tags given the ID, you can do as follows:

In [None]:
sentence.vocab[96].text

where 96 is the ID of the tag. Printing all together:

In [None]:
for ID, frequency in num_pos.items():
    print(f"{ID} stands for {sentence.vocab[ID].text:{8}}: {frequency}")

## Named Entity Recognition

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc.


Example:



In [None]:
doc = nlp("Antonio works at Strive School.")

In [None]:
from spacy import displacy

displacy.render(doc, style="ent")

In [None]:
doc = nlp("Rome is a big city.")

In [None]:
displacy.render(doc, style="ent")

ORG stands for organization, GPE stands for Geopolitical Entity. Some other tags are:

In spaCy you can list the entities by doing:
    

In [None]:
doc = nlp('Manchester United is looking to sign Harry Kane for $90 million')

In [None]:
doc.ents

We can access the entities text, label by doing:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Even if the entities are self-explanatory for this example, you can use `spacy.explain()` for a detailed description.

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_, spacy.explain(ent.label_))

In [None]:
from spacy import displacy

displacy.render(sentence, style="ent")

We can also filter which entity to display:

In [None]:
sentence = nlp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
displacy.render(sentence, style='ent', jupyter=True)

In [None]:
filter = {'ents': ['ORG']}
displacy.render(sentence, style='ent', jupyter=True, options=filter)

## Preprocessing

To deal with text, it's often convenient to wrap all the preprocessing you want to do in a single function. Let's define one that:

- split into tokens
- remove stopwords
- remove punctuation
- make everything lowercase
- lemmatize it

In [None]:
def preprocessing(sentence):
    """
    params sentence: a str containing the sentence we want to preprocess
    return the tokens list
    """
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
    return tokens

In [None]:
preprocessing("This is a sentence I'm going to preprocess")

As you can see, removing "stopwords" can be brutal, cause you will be missing a lot of important words that helps for the meaning of the sentence. For example, we missed the word "I". For this reason, sometimes is better to specify a list of words that you think won't be meaningful for the task you're going to perform. A good way to find them, is by checking the most frequent words in the corpus you have.

Let's say you have this cell as text:

In [None]:
text = """As you can see, removing "stopwords" can be brutal, cause you will be missing a lot of important words that helps for the meaning of the sentence. For example, we missed the word "I". For this reason, sometimes is better to specify a list of words that you think won't be meaningful for the task you're going to perform. A good way to find them, is by checking the most frequent words in the corpus you have.

Let's say you have this cell as text:"""

In [None]:
from collections import Counter

counter = Counter()
counter.update(text.split(" "))

In [None]:
counter

By typing:

In [None]:
counter.most_common(10)

we get the 10 most common words. Let's say that I'm trying to get the gist of the text, then I could say that I get rid of "the" and "a" words. 

In that case, my preprocessing function becomes something like:


In [None]:
STOPWORDS = ["the", "a"]

def preprocessing(sentence):
    """
    params sentence: a str containing the sentence we want to preprocess
    return the tokens list
    """
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.lemma_ in STOPWORDS]
    return tokens

where you see I replaced the control `not token.is_stop()` with `not token.lemma_ in STOPWORDS`.

As usual, be aware of the task you are going to perform is very important to improve the chances of an high accuracy instead of blindly applying a set of steps.