## Importing Spacy

In [1]:
# !conda install -c conda-forge spacy
# !python -m spacy download en_core_web_sm
## I recommend the one above, because the following is more accurate but less efficient
# !python -m spacy download en_core_web_lg

In [1]:
import spacy


nlp = spacy.load("en_core_web_sm")
# You can also load en_core_web_lg that has an higher accuracy but it's less efficient
# nlp = spacy.load("en_core_web_lg")



In [2]:
print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f7ba335fae0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f7ba33404f0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f7ba3551f40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f7ba3698ee0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f7ba3330980>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f7ba331ffc0>)]


In [3]:
# Process sentences 'Hello, world. Antonio is learning Python.' using spaCy
doc = nlp(u"Hello, world. Antonio is learning Python.")

In [4]:
for token in doc:
    print(token.text)

Hello
,
world
.
Antonio
is
learning
Python
.


In [6]:
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

Hello
Hello, world.
Antonio is learning Python.


In [7]:
tokens = nlp("Let's go to N.Y.!")

In [8]:
for token in tokens:
    print(token.text)

Let
's
go
to
N.Y.
!


As you have seen, using `nlp`, that comes from `spacy.load("en_core_web_sm")`, you get the tokenized version of the sentence. If you want only the instance of the `Tokenizer` class, you can run:

In [9]:
tokenizer = nlp.tokenizer
type(tokenizer)

spacy.tokenizer.Tokenizer

If you want to instantiate a custom one, with rules and prefixes and so on:

In [10]:
from spacy.tokenizer import Tokenizer

tokenizer = Tokenizer(vocab=nlp.vocab)

The tokenizer defined above contains only english rules.
Let's test it on "Let's go to N.Y.!"

In [11]:
tokens = tokenizer("Let's go to N.Y.!")
for token in tokens:
    print(token)



Let's
go
to
N.Y.!


As you can see here, it doesn't handle the exceptions about the dots. So we can add rules for this!

In [15]:
prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.prefixes)

In [16]:
tokenizer = Tokenizer(
    vocab=nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search
)

In [17]:
tokens = tokenizer("Let's go to N.Y.!")
for token in tokens:
    print(token)

Let's
go
to
N.Y.
!


You can also check the exceptions the tokenizer can handle:

In [18]:
from spacy.lang.en.tokenizer_exceptions import TOKENIZER_EXCEPTIONS

TOKENIZER_EXCEPTIONS.values()

dict_values([[{65: ' '}], [{65: '\t'}], [{65: '\\t'}], [{65: '\n'}], [{65: '\\n'}], [{65: '—'}], [{65: '\xa0', 67: '  '}], [{65: "'"}], [{65: '\\")'}], [{65: '<space>'}], [{65: "''"}], [{65: 'C++'}], [{65: 'a.'}], [{65: 'b.'}], [{65: 'c.'}], [{65: 'd.'}], [{65: 'e.'}], [{65: 'f.'}], [{65: 'g.'}], [{65: 'h.'}], [{65: 'i.'}], [{65: 'j.'}], [{65: 'k.'}], [{65: 'l.'}], [{65: 'm.'}], [{65: 'n.'}], [{65: 'o.'}], [{65: 'p.'}], [{65: 'q.'}], [{65: 'r.'}], [{65: 's.'}], [{65: 't.'}], [{65: 'u.'}], [{65: 'v.'}], [{65: 'w.'}], [{65: 'x.'}], [{65: 'y.'}], [{65: 'z.'}], [{65: 'ä.'}], [{65: 'ö.'}], [{65: 'ü.'}], [{65: 'O.O'}], [{65: 'XDD'}], [{65: '(-_-)'}], [{65: '=|'}], [{65: 'xDD'}], [{65: '(>_<)'}], [{65: 'ಠ_ಠ'}], [{65: ':-O'}], [{65: ':-)'}], [{65: '^___^'}], [{65: 'ಠ︵ಠ'}], [{65: ':-(('}], [{65: ':-p'}], [{65: ':-((('}], [{65: ':('}], [{65: ':x'}], [{65: '<3'}], [{65: ')-:'}], [{65: '(ಠ_ಠ)'}], [{65: ':-x'}], [{65: '[-:'}], [{65: ';-)'}], [{65: ':-o'}], [{65: ';-D'}], [{65: ':3'}], [{65: '(╯°□°）

In [19]:
tokens = tokenizer("This is a $STOCK.")
for token in tokens:
    print(token)

This
is
a
$
STOCK.


You can add special prefixes in the form of regex by doing:

In [22]:
custom_prefixes = nlp.Defaults.prefixes + [r"\$[a-zA-Z]+"]

In [23]:
prefix_re = spacy.util.compile_prefix_regex(custom_prefixes)



In [24]:
import re

prefix_re = re.compile(r"\$[a-zA-Z]+")
tokenizer = Tokenizer(
    nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search
)

tokens = tokenizer("This is a $STOCK.")
for token in tokens:
    print(token)

This
is
a
$STOCK
.


You can add also special-case tokenization rules. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the [languages data](https://spacy.io/usage/linguistic-features#language-data) and [tokenizer special cases](https://spacy.io/usage/linguistic-features#special-cases) for more details and examples.

In [26]:
from spacy.attrs import ORTH, NORM, LOWER

dont_case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
gimme_case = [{ORTH: "gi", NORM:"give"}, {ORTH: "me", NORM: "me"}]
tokenizer.add_special_case("don't", dont_case)
tokenizer.add_special_case("gimme", gimme_case)
tokens = tokenizer("Yo! gimme five!")
for token in tokens:
    print(token.norm_)
tokens = tokenizer("You don't do that")
for token in tokens:
    print(token.norm_)


ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map 'gimme' to 'gime' given token attributes '[{65: 'gi', 67: 'give'}, {65: 'me', 67: 'me'}]'.

When you load a model with pretrained NER (Named Entity Recognition), like `en_core_web_sm`, it is possible to make the tokenizer to merge the token for the entities it finds. Let's check what is inside the pipeline performed by `nlp`:


In [27]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f7ba335fae0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f7ba33404f0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f7ba3551f40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f7ba3698ee0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f7ba3330980>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f7ba331ffc0>)]

There's a tagger, a dependency parser and the entity recognizer. Let's check the entities of the following sentence:

In [28]:
doc = nlp("Apple is a $1000b company.")


In [29]:
for token in doc:
    print(token)

Apple
is
a
$
1000b
company
.


In [30]:
for ent in doc.ents:
    print(ent, ent.label_)

Apple ORG


In [31]:
doc = nlp(
    "This is Strive School. It's worthy to merge 'Strive School' as a single token instead of two"
)

for token in doc:
    print(token)

This
is
Strive
School
.
It
's
worthy
to
merge
'
Strive
School
'
as
a
single
token
instead
of
two


In [32]:
for ent in doc.ents:
    print(ent, ent.label_)

Strive School ORG
two CARDINAL


Let's add "merge_entities" to the pipeline (you can do it only if there is the entity recognizer):

In [34]:
nlp.add_pipe("merge_entities")

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

In [35]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f7ba335fae0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f7ba33404f0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f7ba3551f40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f7ba3698ee0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f7ba3330980>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f7ba331ffc0>),
 ('merge_entities',
  <function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>)]

In [36]:
doc = nlp(
    "This is Strive School. It's worthy to merge 'Strive School' as a single token instead of two"
)

for token in doc:
    print(token)

This
is
Strive School
.
It
's
worthy
to
merge
'
Strive
School
'
as
a
single
token
instead
of
two


In [37]:
TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]


In [38]:
for sentence in nlp.pipe(TEXTS):
    for token in sentence:
        print(token)
    print("------------------")

Net
income
was
$9.4 million
compared
to
the prior year
of
$2.7 million
.
------------------
Revenue
exceeded
twelve billion dollars
,
with
a
loss
of
$
1b
.
------------------


It's also possible to merge the noun chunks into one:

In [39]:
nlp.add_pipe("merge_noun_chunks")

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [40]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f7ba335fae0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f7ba33404f0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f7ba3551f40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f7ba3698ee0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f7ba3330980>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f7ba331ffc0>),
 ('merge_entities',
  <function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>),
 ('merge_noun_chunks',
  <function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>)]

In [41]:
doc = nlp("Hello, I'm Antonio Marsella, nice to meet you.")
for token in doc:
    print(token)

Hello
,
I
'm
Antonio Marsella
,
nice
to
meet
you
.


## Removing stop words

In general, it's convenient to remove all the stop words, *i.e. very common words in a language*, because they don't help most of NLP problem such as semantic analysis.

In [42]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print("Number of stop words: %d" % len(spacy_stopwords))
print("First ten stop words: %s" % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['thence', 'whence', 'any', 'eleven', 'between', 'been', 'beforehand', 'whoever', 'however', 'another']


To remove them:

In [43]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

doc = nlp(text)

tokens = [token.text for token in doc if not token.is_stop]
for token in tokens:
    print(token)

determined
drop
his litigation
the monastry
,
relinguish
his claims
the wood-cuting


fishery rihgts
.
ready
becuase
the rights
valuable
,


indeed the vaguest idea
the wood
river
question
.


For adding customized stop words:

In [44]:
customize_stop_words = ["computing", "filtered"]
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

## Stemming and Lemmatization

In most natural languages, a root word can have many variants. For example, the word ‘play’ can be used as ‘playing’, ‘played’, ‘plays’, etc. You can think of similar examples (and there are plenty).

**Stemming**

Let’s first understand stemming:

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
It is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word
 

**Lemmatization**

Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Stemming algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth linguistic knowledge. Hence, Lemmatization helps in forming better features.

In [45]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_entities")
# not using merge_chunk_nouns
doc = nlp(
    u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""
)

lemma_word1 = []
for token in doc:
    if token.is_stop:
        continue
    lemma_word1.append(token.lemma_)
lemma_word1

['determine',
 'drop',
 'litigation',
 'monastry',
 ',',
 'relinguish',
 'claim',
 'wood',
 '-',
 'cuting',
 '\n',
 'fishery',
 'rihgts',
 '.',
 'ready',
 'becuase',
 'right',
 'valuable',
 ',',
 '\n',
 'vague',
 'idea',
 'wood',
 'river',
 'question',
 '.']

## Removing the punctuation



In [46]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""


import string

text_no_punct = "".join([char for char in text if char not in string.punctuation])

text_no_punct

'He determined to drop his litigation with the monastry and relinguish his claims to the woodcuting and \nfishery rihgts at once He was the more ready to do this becuase the rights had become much less valuable and he had \nindeed the vaguest idea where the wood and river in question were'

In [47]:
doc = nlp(text_no_punct)
for token in doc:
    print(token)

He
determined
to
drop
his
litigation
with
the
monastry
and
relinguish
his
claims
to
the
woodcuting
and


fishery
rihgts
at
once
He
was
the
more
ready
to
do
this
becuase
the
rights
had
become
much
less
valuable
and
he
had


indeed
the
vaguest
idea
where
the
wood
and
river
in
question
were


For text extracted from dialogues or chats, it is convenient to preprocess the text so that multiple occurrences of the same characters get condensed into one or two, and then use a spell checker to find the correct form of the word.

A way to do that is to replace all the occurrences of repeated characters with a single one and then use a spell checker: "hhheeelllllooo hoooowww areee youuu?" becomes "helo how are you?" and then the spell checker would make it "hello how are you?"




In [48]:
st = "hhheeeLLLLooo hoooowww areee youuu?????"
text = re.sub(r"(.)\1+", r"\1", st)
text

'heLo how are you?'

In [53]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.3.0.tar.gz (621 kB)
[K     |████████████████████████████████| 621 kB 2.2 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25ldone
[?25h  Created wheel for autocorrect: filename=autocorrect-2.3.0-py3-none-any.whl size=621586 sha256=32ae51b3f669f19a7a2cc0d2d5103e9d672bdc9bd6eed32f40a374f5982225f3
  Stored in directory: /home/roy/.cache/pip/wheels/fe/6e/8a/4e8bafec0225cfbdf79a0da722b691e4dc5d20d197423e8b28
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.3.0


In [51]:
from autocorrect import SpellChecker

text = nlp(text)
spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown([token.text for token in text])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

ModuleNotFoundError: No module named 'indexer'

It didn't find any mispelled (even if there was "helo"). Try another spell checker:

https://github.com/fsondej/autocorrect

In [55]:
from autocorrect import Speller

spell = Speller()
text = nlp(text)
spell(text.text)

'hero how are you?'

As you can see, it's not always working properly! However, overall it should improve your text.

If you want to create a separate lemmatizer instead of having it in the pipeline:

In [63]:
from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB

lemmatizer = nlp.vocab.morphology.lemmatizer
print(lemmatizer("studying", VERB))
print(lemmatizer("studying", NOUN))
print(lemmatizer("studying", ADJ))

ModuleNotFoundError: No module named 'spacy.lemmatizer'

In [64]:
nlp.vocab.lookups.tables

['lexeme_norm']

spaCy has no built-in stemming! However, Lemmatization is enough for most of the tasks. As alternative, you can use [NLTK library](https://www.nltk.org).

## Named Entity Recognition

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc.


Example:



In [65]:
doc = nlp("Antonio works at Strive School.")

In [66]:
from spacy import displacy

displacy.render(doc, style="ent")

In [67]:
doc = nlp("Rome is a big city.")

In [68]:
displacy.render(doc, style="ent")

ORG stands for organization, GPE stands for Geopolitical Entity. Some other tags are: