# **NLTK vs SpaCy**

In general, however, NLTK and SpaCy are designed to be efficient and performant libraries for natural language processing, and they both offer a range of optimization techniques and algorithms to help speed up processing.

However, there are some differences between tasks and operations that both these libraries do in terms of speed and performance.

At times, one may be faster than the other. For example, SpaCy is designed to be particularly fast at tasks like tokenization and part-of-speech tagging, and it uses optimized Cython-based implementations of many of its core algorithms to achieve this performance.

On the other hand, NLTK may be faster at certain tasks that require more comprehensive or flexible processing, such as language modeling or machine learning.

It is also important to note that there are various other libraries you can use to perform these common pre-processing steps we have covered in the last few classes. For example, Gensim, CoreNLP, and Scikit-learn also have some preprocessing support as well so there are many options to choose from when preprocessing and it might come down to what else you want to do downstream of preprocessing and whether the library used has support for that task.

Let's cover some examples where you might see evidence that SpaCy and NLTK are different.

### **SpaCy and NLTK Tokenization**

In [4]:
##NLTK
import nltk
#if you need to, download the following if missing
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('brown')

import string
from nltk.corpus import brown
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#accessing the stop words
stopwords_nltk = stopwords.words('english')

corpus = brown.words(categories=['hobbies'])
corpus = [''.join(c for c in word if c.isalpha()) for word in corpus]
corpus = [word for word in corpus if word !='']
corpus = " ".join(corpus)

#tokenized text
tokenized_corpus = word_tokenize(corpus)

# How many stop words in NLTK?
print("Number of Stopwords: ", len(stopwords_nltk))

print("Corpus words: ", len(corpus))
#removal of stopwords
nltk_without_stopwords = [word for word in corpus if not word in stopwords_nltk]
print("Corpus words (Stopwords removed): ", len(nltk_without_stopwords))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Number of Stopwords:  179
Corpus words:  414574
Corpus words (Stopwords removed):  258454


In [5]:
##SPACY
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

nlp = English()
tokenizer = Tokenizer(nlp.vocab)
tokens_spacy = tokenizer(corpus)

stopwords_spacy = spacy.lang.en.STOP_WORDS
# How many stopwords in SpaCy?
print("Number of Stopwords: ", len(stopwords_spacy))

#removal of stopwords
spacy_without_stopwords= [word for word in corpus if not word in stopwords_spacy]


print("Corpus words (Stopwords removed): ", len(spacy_without_stopwords))

Number of Stopwords:  326
Corpus words (Stopwords removed):  363164


## **Lemmatization and POS tagging between NLTK and SpaCy**

In [14]:
text = 'Running is one of the gaits of terrestrial locomotion among legged animals'
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

for token in doc:
    print(token.text + "-->" + token.lemma_)

for token in doc:
    print(token.text + "-->" + token.pos_)

Running-->run
is-->be
one-->one
of-->of
the-->the
gaits-->gait
of-->of
terrestrial-->terrestrial
locomotion-->locomotion
among-->among
legged-->legged
animals-->animal
Running-->VERB
is-->AUX
one-->NUM
of-->ADP
the-->DET
gaits-->NOUN
of-->ADP
terrestrial-->ADJ
locomotion-->NOUN
among-->ADP
legged-->ADJ
animals-->NOUN


In [16]:
# Show the difference between lemmatization and stemming in NLTK
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

tokens = word_tokenize(text)

nltk_lemmas = []

#for word in tokens:
#    nltk_lemmas.append(lemmatizer.lemmatize(word))
nltk_lemmas = [lemmatizer.lemmatize(word, pos='v') if word == 'is' else lemmatizer.lemmatize(word) for word in tokens]

for token, lemma in zip(tokens, nltk_lemmas):
    print(f"{token} --> {lemma}")

Running --> Running
is --> be
one --> one
of --> of
the --> the
gaits --> gait
of --> of
terrestrial --> terrestrial
locomotion --> locomotion
among --> among
legged --> legged
animals --> animal


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Is there a difference between lemmatization in SpaCy and NLTK?

## **Execution Time**

In [None]:
import time
import nltk
import spacy

# Load the text data
with open('plots.txt', 'r') as f:
  text = f.read()

# Tokenize the text using NLTK
st = time.time()
tokens_nltk = nltk.word_tokenize(text)
et = time.time()
time_nltk = et - st

# Tokenize the text using SpaCy
spacy_model = English()
tokenizer = Tokenizer(nlp.vocab)
st = time.time()
tokens = tokenizer(corpus)
tokens_spacy = [token.text for token in tokens]
et = time.time()
time_spacy = et - st

# Print the results
print(f'NLTK: {time_nltk:.4f} seconds')
print(f'SpaCy: {time_spacy:.4f} seconds')

NLTK: 0.1357 seconds
SpaCy: 0.0832 seconds


In [None]:
#nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

# Tag the text using NLTK
st = time.time()
tokens = nltk.word_tokenize(text)
tags_nltk = nltk.pos_tag(tokens)
et = time.time()
time_nltk = et - st

# Tag the text using SpaCy
nlp = spacy.load('en_core_web_sm', disable=["ner", "parser", "textcat", "entity_linker", "sentencizer"])
st = time.time()
doc = nlp(text)
#tags = [(token.text, token.pos_) for token in doc]
et = time.time()
time_spacy = et - st

# Print the results
print(f'NLTK: {time_nltk:.4f} seconds')
print(f'SpaCy: {time_spacy:.4f} seconds')

NLTK: 0.8632 seconds
SpaCy: 1.7484 seconds


**Exercise:** Try changing this code to time sentence tokenization.

Keep in mind that the performance these functions may vary depending on the specific hardware and software environment meaning the results of the benchmark may not be directly comparable between different systems.


## **For your interest: Creating your own exceptions in SpaCy**

Sometimes, different aspects of these libraries do not work quite the way you would like to. Both NLTK and SpaCy offer options to add your own rules and exceptions to handle the text in a customized way. Some examples are shown below (note that there are far more customization options than I show here). SpaCy has some good documentation on different possibilities for customization. Here is an example: https://spacy.io/usage/rule-based-matching. Or see https://spacy.io/api/tokenizer for more information on how to use the special case attributes.

**Tokenization**

In [None]:
from spacy.attrs import ORTH, NORM

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer.add_special_case("U.S.", [{ORTH: "U.S.", NORM: "U.S."}])

test_text = "We are going to the U.S."

doc = nlp(test_text)
print([token.text for token in doc])

['We', 'are', 'going', 'to', 'the', 'U.S.']


ORTH is the exact verbatim text of a token. NORM is the normalized form of the text.

An example that is already built into the standard SpaCy lemmatizer would be how to handle "don't", which would look like {ORTH: "do"} and {ORTH: "n't", NORM: "not"}.

As another example, let's say we have the name/title of someone that we want to be considered a single entity. For example,
This could be useful for organizations as well since these are not always properly recognized.

**Parts of Speech Tagging**

In [None]:
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

# Define a pattern for the phrase
pattern = [{'LOWER': 'alexander'}, {'LOWER': 'the'}, {'LOWER': 'great'}]

# Add the phrase and pattern to the entity recognizer
matcher = Matcher(nlp.vocab)
matcher.add("PERSON", [pattern])

In [None]:
doc = nlp("Alexander the Great was a king of the ancient Greek kingdom of Macedon.")
matches = matcher(doc)

for ent in doc.ents:
    print(ent.text, ent.label_)

Alexander the Great PERSON
Greek NORP
Macedon GPE


**Additional stop words**

In [22]:
import spacy
from spacy.lang.en import STOP_WORDS

nlp = spacy.load("en_core_web_sm")

# add new stop words
custom_stop_words = {"custom", "stop", "words"}

# add them to the list of stop words in this spaCy model
for word in custom_stop_words:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

sentences = [
    "This is a custom sentence.",
    "The stop sign is red.",
    "These words are important."
]

for sentence in sentences:
    doc = nlp(sentence)
    print(f"Original sentence: {sentence}")
    print("Stop words in the sentence:")
    for token in doc:
        if token.is_stop:
            print(token.text)
    print("----")


Original sentence: This is a custom sentence.
Stop words in the sentence:
This
is
a
custom
----
Original sentence: The stop sign is red.
Stop words in the sentence:
The
stop
is
----
Original sentence: These words are important.
Stop words in the sentence:
These
words
are
----
