<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - NLTK<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1: Introduction

## ❓ What is NLTK? ([NLTK Official Website](https://www.nltk.org/))
The Natural Language Toolkit (NLTK) is a popular Python library for working with human language data. It provides easy access to a wide range of text processing tools such as **tokenization**, **stemming**, **lemmatization**, **part-of-speech tagging**, and more. NLTK also includes various corpora and lexical resources to support NLP tasks.

It’s often used for:
- Teaching NLP concepts
- Research prototyping
- Basic language analysis tasks

---

## ✅ When to Use NLTK
- **Educational Purposes:** It is ideal for learning fundamental NLP concepts such as tokenization, POS tagging, parsing, etc., because of its transparent and modular design.
- **Prototyping:** When you want to quickly try out an idea without needing large-scale data or high performance.
- **Rule-Based NLP:** For projects that involve hand-crafted rules or custom tokenization.
- **Working with Classical NLP Tasks:** Especially if you want to explore linguistic structure using treebanks, grammars, or symbolic approaches.
- **Lightweight Projects:** If you don’t need deep learning models or large-scale pipelines, NLTK provides a rich set of tools out of the box.

---

## 🚫 NLTK is not recommended for:
- **Large-scale production systems** (use spaCy or Hugging Face instead)
- **Neural network-based NLP** (use PyTorch, TensorFlow, or Hugging Face Transformers)

## ⚙️ Installation & Setup

In [18]:
!pip install nltk



In [19]:
import nltk

# ✂️ Part 2: Basic Text Preprocessing

Text preprocessing is the first and most important step in any NLP pipeline. It involves cleaning and structuring raw text into a format that can be used for further analysis or modeling.

## 📦 Downloading Resources

In [20]:
nltk.download('punkt_tab')  # For tokenization, NLTK 3.8.2 or newer
nltk.download('punkt')  # For tokenization, NLTK 3.8.1 or older
nltk.download('stopwords')  # For stopword removal
nltk.download('wordnet')  # For stemming and lemmatization

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 🔹 2.1 Tokenization
Tokenization is the process of breaking down text into smaller units like **words** or **sentences**.

**⚠️ Note: Persian is not natively supported by NLTK.**

### ✅ Sentence Tokenization

In [21]:
from nltk.tokenize import sent_tokenize

text = "The children were playing in the garden. They had a great time."
sentences = sent_tokenize(text)
print(sentences)

['The children were playing in the garden.', 'They had a great time.']


### ✅ Word Tokenization

In [22]:
from nltk.tokenize import word_tokenize

words = word_tokenize(text)
print(words)

['The', 'children', 'were', 'playing', 'in', 'the', 'garden', '.', 'They', 'had', 'a', 'great', 'time', '.']


## 🔹 2.2 Stopword Removal

In [23]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

removed = [word for word in words if word.lower() in stop_words]
filtered_words = [word for word in words if word.lower() not in stop_words]

print('Removed stopwords:', removed)
print('Filtered words:', filtered_words)

Removed stopwords: ['The', 'were', 'in', 'the', 'They', 'had', 'a']
Filtered words: ['children', 'playing', 'garden', '.', 'great', 'time', '.']


## 🔹 2.3 Stemming

In [24]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)

['children', 'play', 'garden', '.', 'great', 'time', '.']


## 🔹 2.4 Lemmatization

In [25]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized_words)

['child', 'playing', 'garden', '.', 'great', 'time', '.']


# 🧩 Part 3: POS Tagging (Part-of-Speech Tagging)

**POS Tagging** is the process of labeling each word in a sentence with its corresponding part of speech, such as **noun**, **verb**, **adjective**, etc.

This helps in understanding the **grammatical structure** of a sentence and is used in many NLP tasks like **parsing**, **named entity recognition**, and **question answering**.

## 📦 Downloading Resources

In [26]:
nltk.download('averaged_perceptron_tagger_eng')  # POS tagger model

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

## 🔹 3.1 Basic POS Tagging

In [27]:
from nltk import pos_tag, word_tokenize

text = 'In October 2024, Dr. Emily Smith of MIT and Professor Brian Lee from Stanford University visited the Googleplex in Mountain View, California to present their NLP research at the ACL workshop.'

tokens = word_tokenize(text)
tagged = pos_tag(tokens)

print('Tagged:', tagged)

Tagged: [('In', 'IN'), ('October', 'NNP'), ('2024', 'CD'), (',', ','), ('Dr.', 'NNP'), ('Emily', 'NNP'), ('Smith', 'NNP'), ('of', 'IN'), ('MIT', 'NNP'), ('and', 'CC'), ('Professor', 'NNP'), ('Brian', 'NNP'), ('Lee', 'NNP'), ('from', 'IN'), ('Stanford', 'NNP'), ('University', 'NNP'), ('visited', 'VBD'), ('the', 'DT'), ('Googleplex', 'NNP'), ('in', 'IN'), ('Mountain', 'NNP'), ('View', 'NNP'), (',', ','), ('California', 'NNP'), ('to', 'TO'), ('present', 'VB'), ('their', 'PRP$'), ('NLP', 'NNP'), ('research', 'NN'), ('at', 'IN'), ('the', 'DT'), ('ACL', 'NNP'), ('workshop', 'NN'), ('.', '.')]


## 🔹 3.2 POS Tag Explanation

The tags returned by `pos_tag()` follow the **Penn Treebank** tag set. Here are a few common ones:

| Tag | Meaning            | Example       |
|-----|---------------------|---------------|
| NN  | Noun, singular      | cat, car      |
| NNS | Noun, plural        | cats, cars    |
| VB  | Verb, base form     | run           |
| VBD | Verb, past tense    | ran           |
| JJ  | Adjective           | happy         |
| RB  | Adverb              | quickly       |
| IN  | Preposition         | in, on        |
| DT  | Determiner          | the, a        |
| PRP | Personal pronoun    | he, she, it   |

In [28]:
nltk.download('tagsets_json')

[nltk_data] Downloading package tagsets_json to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package tagsets_json is already up-to-date!


True

In [29]:
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


In [30]:
nltk.help.upenn_tagset('DT')

DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those


# 🧾 Part 4: Named Entity Recognition (NER)

**Named Entity Recognition (NER)** is the process of identifying and classifying named entities in text into predefined categories such as **person names**, **organizations**, **locations**, **dates**, etc.

## 📦 Downloading Resources

In [31]:
nltk.download('maxent_ne_chunker_tab')  # NER chunker
nltk.download('words')  # Required for NE chunking

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## 🔹 4.1 Basic NER with `ne_chunk()`

In [32]:
from nltk import word_tokenize, pos_tag, ne_chunk

text = 'In October 2024, Dr. Emily Smith of MIT and Professor Brian Lee from Stanford University visited the Googleplex in Mountain View, California to present their NLP research at the ACL workshop.'

# Tokenize, tag, and chunk
tokens = word_tokenize(text)
tags = pos_tag(tokens)
named_entities = ne_chunk(tags)

print(named_entities)

(S
  In/IN
  October/NNP
  2024/CD
  ,/,
  Dr./NNP
  (PERSON Emily/NNP Smith/NNP)
  of/IN
  (ORGANIZATION MIT/NNP)
  and/CC
  (ORGANIZATION Professor/NNP Brian/NNP Lee/NNP)
  from/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  visited/VBD
  the/DT
  (ORGANIZATION Googleplex/NNP)
  in/IN
  (GPE Mountain/NNP View/NNP)
  ,/,
  (GPE California/NNP)
  to/TO
  present/VB
  their/PRP$
  (ORGANIZATION NLP/NNP)
  research/NN
  at/IN
  the/DT
  (ORGANIZATION ACL/NNP)
  workshop/NN
  ./.)


> 📌 Note: The `ne_chunk` function doesn't always recognize dates and money values perfectly. For high accuracy, deep-learning-based NER tools like spaCy or Hugging Face are preferred — and you'll cover them later in the workshop.

## 🔹 4.2 Display Named Entities in a More Readable Way

You can extract only the named entities and display them clearly:

In [33]:
from nltk.tree import Tree


def extract_named_entities(chunked_tree):
    named_entities = []

    for subtree in chunked_tree:
        if isinstance(subtree, Tree):
            entity = ' '.join([token for token, tag in subtree.leaves()])
            entity_type = subtree.label()
            named_entities.append((entity, entity_type))

    return named_entities


entities = extract_named_entities(named_entities)
print('Named Entities:', entities)

Named Entities: [('Emily Smith', 'PERSON'), ('MIT', 'ORGANIZATION'), ('Professor Brian Lee', 'ORGANIZATION'), ('Stanford University', 'ORGANIZATION'), ('Googleplex', 'ORGANIZATION'), ('Mountain View', 'GPE'), ('California', 'GPE'), ('NLP', 'ORGANIZATION'), ('ACL', 'ORGANIZATION')]


## 🔹 4.3 Common Named Entity Types

The table below summarizes the most common named entity labels used by NLTK’s ne_chunk() function (based on the Penn Treebank and CoNLL tagsets):

| Label            | Meaning                                          |
|------------------|--------------------------------------------------|
| **PERSON**       | Individual people                                |
| **ORGANIZATION** | Companies, institutions, agencies                |
| **GPE**          | Geopolitical entities: countries, cities, states |
| **FACILITY**     | Physical facilities: buildings, airports, labs   |
| **LOCATION**     | Natural locations: rivers, mountains, regions    |

# 🗂️ Part 5: Text Classification (Optional)

Text classification is the task of assigning a category or label to a given piece of text. Examples include:
- Spam detection
- Sentiment analysis
- News categorization

NLTK offers tools for training basic classifiers using classical machine learning (not deep learning).

## 📦 Downloading Resources

In [34]:
nltk.download('movie_reviews')  # Sample dataset

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/AmirMohammad/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

## 🔹 5.1 Load the Dataset

We’ll use the movie reviews corpus included in NLTK. It contains 1000 positive and negative reviews labeled accordingly.

In [35]:
from nltk.corpus import movie_reviews
import random

In [36]:
# Set random seed for reproducibility
random.seed(42)

In [37]:
print('Categories:', movie_reviews.categories())
print('Number of reviews:', len(movie_reviews.fileids()))

print('Positive reviews:', len(movie_reviews.fileids('pos')))
print('Negative reviews:', len(movie_reviews.fileids('neg')))

Categories: ['neg', 'pos']
Number of reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


In [38]:
# Create list of (words, category) pairs
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

# Shuffle documents for randomness
random.shuffle(documents)

In [39]:
print(documents[0])

(['mr', '.', 'bean', ',', 'a', 'bumbling', 'security', 'guard', 'from', 'england', 'is', 'sent', 'to', 'la', 'to', 'help', 'with', 'the', 'grandiose', 'homecoming', 'of', 'a', 'masterpiece', 'american', 'painting', '.', 'the', 'first', 'two', 'words', 'should', 'have', 'said', 'enough', 'to', 'let', 'you', 'know', 'what', 'occurs', 'during', 'bean', "'", 's', 'trip', 'to', 'la', ',', 'but', 'if', 'they', 'didn', "'", 't', 'look', 'out', 'because', 'you', 'are', 'in', 'for', 'a', 'rather', 'interesting', 'if', 'not', 'odd', 'ride', '.', 'heck', 'depending', 'on', 'your', 'humor', 'you', 'might', 'end', 'up', 'laughing', 'through', 'the', 'whole', 'flick', '.', 'either', 'way', 'look', 'out', 'america', 'bean', 'is', 'coming', '.', 'well', ',', 'what', 'can', 'really', 'be', 'said', 'about', 'this', 'movie', ',', 'there', 'is', 'very', 'little', 'discernible', 'plot', '.', 'that', 'much', 'is', 'not', 'hard', 'to', 'grapple', 'with', 'for', 'it', 'is', 'a', 'slapstick', 'comedy', '.', 'i

## 🔹 5.2 Feature Extraction

We’ll define features as whether a given word is present in a document.

In [40]:
from nltk import FreqDist

all_words = FreqDist(w.lower() for w in movie_reviews.words())
all_words

FreqDist({',': 77717, 'the': 76529, '.': 65876, 'a': 38106, 'and': 35576, 'of': 34123, 'to': 31937, "'": 30585, 'is': 25195, 'in': 21822, ...})

In [41]:
word_features = list(all_words)[:2000]
print('Top 20 features:', word_features[:20])

Top 20 features: [',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for']


In [42]:
def document_features(document):
    document_words = set(document)
    features = {}

    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features


# Apply to the first document, documents[0][0] is list of words
features = document_features(documents[0][0])
print('Length of features:', len(features))
print('Features:', features)

Length of features: 2000
Features: {'contains(,)': True, 'contains(the)': True, 'contains(.)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': True, 'contains(to)': True, "contains(')": True, 'contains(is)': True, 'contains(in)': True, 'contains(s)': True, 'contains(")': False, 'contains(it)': True, 'contains(that)': True, 'contains(-)': True, 'contains())': False, 'contains(()': False, 'contains(as)': True, 'contains(with)': True, 'contains(for)': True, 'contains(his)': False, 'contains(this)': True, 'contains(film)': False, 'contains(i)': True, 'contains(he)': False, 'contains(but)': True, 'contains(on)': True, 'contains(are)': True, 'contains(t)': True, 'contains(by)': True, 'contains(be)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(an)': False, 'contains(who)': False, 'contains(not)': True, 'contains(you)': True, 'contains(from)': True, 'contains(at)': True, 'contains(was)': True, 'contains(have)': True, 'contains(they)': True, 'contains(has)'

## 🔹 5.3 Train a Classifier

In [43]:
from nltk import classify, NaiveBayesClassifier

# Create the feature sets
feature_sets = [(document_features(d), c) for (d, c) in documents]

# Train/test split
train_set, test_set = feature_sets[:800], feature_sets[800:]

# Train Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)
print('Accuracy:', classify.accuracy(classifier, test_set))

Accuracy: 0.8133333333333334


## 🔹 5.4 Most Informative Features

In [44]:
classifier.show_most_informative_features(10)

Most Informative Features
   contains(wonderfully) = True              pos : neg    =     10.9 : 1.0
   contains(outstanding) = True              pos : neg    =      8.7 : 1.0
         contains(awful) = True              neg : pos    =      7.0 : 1.0
         contains(damon) = True              pos : neg    =      6.8 : 1.0
       contains(patrick) = True              pos : neg    =      6.6 : 1.0
         contains(waste) = True              neg : pos    =      6.5 : 1.0
         contains(worst) = True              neg : pos    =      6.5 : 1.0
          contains(lame) = True              neg : pos    =      6.4 : 1.0
      contains(godzilla) = True              neg : pos    =      6.4 : 1.0
       contains(unfunny) = True              neg : pos    =      5.8 : 1.0


### 🧠 How to Interpret This

Each line tells you:
- A specific **word** that appears in a review
- Its predicted **sentiment direction** (positive or negative)
- And a **ratio** showing how strongly it correlates with that sentiment

---

🔎 Example Breakdown:
- `contains(wonderfully) = True pos : neg = 10.9 : 1.0`
    - → If a review contains the word wonderfully, it is 10.9 times more likely to be positive than negative.
    - 💬 Interpretation: "wonderfully" is a strong indicator of a positive review.
- `contains(awful) = True neg : pos = 7.0 : 1.0`
    - → If a review contains awful, it’s 7x more likely to be negative.
    - 💬 "awful" is clearly associated with negative reviews.
- `contains(damon) = True pos : neg = 6.8 : 1.0`
    - → "damon" appears more often in positive reviews — maybe because reviewers like Matt Damon!
- `contains(godzilla) = True neg : pos = 6.4 : 1.0`
    - → In this dataset, "godzilla" is more often mentioned in negative reviews. (Maybe the movie was bad 🙃)

## ⚠️ Notes
- These are not causal relationships — just correlations from the training data.
- The features are based on word presence only (contains(word) = True), so no context or syntax is considered.
- Words like ‘patrick’, ‘damon’, or ‘godzilla’ could be highly dataset-specific.
- This is a very basic method and doesn’t handle word order, syntax, or semantics. In real NLP systems, deep learning models (e.g., BERT) are typically used.

# 🔚 Part 6: Summary

Now that we’ve explored NLTK, it’s helpful to summarize what it’s good for and where other tools like spaCy might offer advantages.

## ✅ What We've Learned in This NLTK Section:

| Topic                   | What We Did                                                                       |
|-------------------------|-----------------------------------------------------------------------------------|
| **Tokenization**        | Split text into words and sentences using `word_tokenize()` and `sent_tokenize()` |
| **Stopword Removal**    | Removed common words using `nltk.corpus.stopwords`                                |
| **Stemming**            | Reduced words to their root using `PorterStemmer`                                 |
| **Lemmatization**       | Reduced words to dictionary form using `WordNetLemmatizer`                        |
| **POS Tagging**         | Labeled each word with its part of speech using `pos_tag()`                       |
| **NER**                 | Identified named entities with `ne_chunk()`                                       |
| **Text Classification** | Built a Naive Bayes classifier using word features                                |


# 🧪 Part 7: Exercises

These exercises help you apply the concepts you've learned in this notebook.

## 🔹 7.1 Write `preprocess`

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()
LEMMATIZER = WordNetLemmatizer()
PUNCTUATION = set(string.punctuation)


def preprocess_nltk(text: str, use_lemmatizer: bool = True) -> list[str]:
    """
    1. Tokenize text into words.
    2. Remove punctuation tokens.
    3. Remove stopwords and non-alphabetic tokens.
    4. POS-tag and keep only NOUN, VERB, ADJ, ADV.
    5. Lemmatize (or stem) and lowercase.
    """
    # yor code here
    pass

In [3]:
text = "Hello, world! The quick (brown) foxes—are jumping happily over lazy dogs."
print('Lemmatized:', preprocess_nltk(text, use_lemmatizer=True))
print('Stemmed:   ', preprocess_nltk(text, use_lemmatizer=False))
# Expected:
# Lemmatized: ['world', 'quick', 'jumping', 'happily', 'lazy', 'dog']
# Stemmed:    ['world', 'quick', 'jump', 'happili', 'lazi', 'dog']

Lemmatized: None
Stemmed:    None


### ✅ Answer

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()
LEMMATIZER = WordNetLemmatizer()
PUNCTUATION = set(string.punctuation)


def preprocess_nltk(text: str, use_lemmatizer: bool = True) -> list[str]:
    """
    1. Tokenize text into words.
    2. Remove punctuation tokens.
    3. Remove stopwords and non-alphabetic tokens.
    4. POS-tag and keep only NOUN, VERB, ADJ, ADV.
    5. Lemmatize (or stem) and lowercase.
    """
    # 1. Tokenize
    tokens = word_tokenize(text)

    # 2. Remove pure punctuation tokens
    tokens = [t for t in tokens if t not in PUNCTUATION]

    # 3. Filter out non-alpha and stopwords
    tokens = [t for t in tokens if t.isalpha() and t.lower() not in STOPWORDS]

    # 4. POS-tag and filter by allowed tags
    tagged = nltk.pos_tag(tokens)
    allowed = {
        'NN', 'NNS',
        'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ',
        'JJ', 'JJR', 'JJS',
        'RB', 'RBR', 'RBS'
    }
    tokens = [word for word, tag in tagged if tag in allowed]

    # 5. Normalize
    if use_lemmatizer:
        tokens = [LEMMATIZER.lemmatize(t) for t in tokens]
    else:
        tokens = [STEMMER.stem(t) for t in tokens]

    # Lowercase
    return [t.lower() for t in tokens]

In [5]:
text = "Hello, world! The quick (brown) foxes—are jumping happily over lazy dogs."
print('Lemmatized:', preprocess_nltk(text, use_lemmatizer=True))
print('Stemmed:   ', preprocess_nltk(text, use_lemmatizer=False))

NameError: name 'nltk' is not defined

## 🔹 7.2 Create an NLTK “Pipeline” Wrapper
If you want to apply this to many documents, wrap it in a simple class:

In [6]:
class NLTKPreprocessor:
    def __init__(self, use_lemmatizer: bool = True):
        self.use_lemmatizer = use_lemmatizer

    def __call__(self, text: str) -> list[str]:
        return preprocess_nltk(text, self.use_lemmatizer)


# Instantiate and test
pp = NLTKPreprocessor(use_lemmatizer=True)
print(pp("NLTK pipelines can be neatly organized into functions."))

NameError: name 'nltk' is not defined

## 🔹 7.3 Performance Comparison
Measure how long it takes to preprocess a batch of texts with vs. without lemmatization:

In [9]:
import time

docs = [
    "Natural Language Processing with NLTK is fun." * 50
    for _ in range(500)
]


def time_it(pp, docs):
    start = time.perf_counter()
    [pp(d) for d in docs]
    return time.perf_counter() - start


pp_lemma = NLTKPreprocessor(use_lemmatizer=True)
pp_stem = NLTKPreprocessor(use_lemmatizer=False)

t_lemma = time_it(pp_lemma, docs)
t_stem = time_it(pp_stem, docs)

print(f'Lemmatizer time: {t_lemma:.2f}s')
print(f'Stemmer time:    {t_stem:.2f}s')

Lemmatizer time: 2.59s
Stemmer time:    1.59s
