# NLTK vs spaCy: Comprehensive Code Comparison

This guide offers side-by-side code comparisons between NLTK and spaCy for a wide range of NLP tasks, from basic to advanced.

## Table of Contents

1. [Setup and Installation](#setup-and-installation)
2. [Basic NLP Tasks](#basic-nlp-tasks)
   - [Tokenization](#1-tokenization)
   - [Part-of-Speech Tagging](#2-part-of-speech-tagging)
   - [Lemmatization](#3-lemmatization)
   - [Stemming](#4-stemming)
   - [Stop Words Removal](#5-stop-words-removal)
3. [Intermediate NLP Tasks](#intermediate-nlp-tasks)
   - [Named Entity Recognition](#6-named-entity-recognition)
   - [Dependency Parsing](#7-dependency-parsing)
   - [Sentence Segmentation](#8-sentence-segmentation)
   - [Text Preprocessing Pipeline](#9-text-preprocessing-pipeline)
   - [N-grams Generation](#10-n-grams-generation)
4. [Advanced NLP Tasks](#advanced-nlp-tasks)
   - [Word Embeddings](#11-word-embeddings)
   - [Text Classification](#12-text-classification)
   - [Sentiment Analysis](#13-sentiment-analysis)
   - [Topic Modeling](#14-topic-modeling)
   - [Text Summarization](#15-text-summarization)
   - [Language Detection](#16-language-detection)
   - [Chunking and Shallow Parsing](#17-chunking-and-shallow-parsing)
   - [Custom NLP Pipeline Components](#18-custom-nlp-pipeline-components)
5. [Practical Use Cases](#practical-use-cases)
   - [Text Similarity Comparison](#19-text-similarity-comparison)
   - [Keyword Extraction](#20-keyword-extraction)
   - [Question Answering](#21-question-answering)
6. [Performance Benchmarks](#performance-benchmarks)
7. [When to Use Which Library](#when-to-use-which-library)

## Setup and Installation

Let's start with how to set up each library:

### Core NLP libraries
```python
uv add nltk spacy
```

### Scikit-learn for ML components
```python
uv add scikit-learn
```

### Additional processing libraries
```python
uv add fensim textblob pandas numpy
```

### Visualization
```python
uv add matplotlib seaborn
```

In [4]:
# Import nltk and spacy
import nltk
import spacy

### Download spaCy language models

In [14]:
# small model --> faster
!python -m spacy download en_core_web_sm

# # medium model --> included word vectors
# python -m spacy download en_core_web_md

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --- ------------------------------------ 1.0/12.8 MB 6.3 MB/s eta 0:00:02
     ------ --------------------------------- 2.1/12.8 MB 6.9 MB/s eta 0:00:02
     ------------ --------------------------- 3.9/12.8 MB 6.7 MB/s eta 0:00:02
     ----------------- ---------------------- 5.5/12.8 MB 7.0 MB/s eta 0:00:02
     ---------------------- ----------------- 7.1/12.8 MB 7.3 MB/s eta 0:00:01
     --------------------------- ------------ 8.7/12.8 MB 7.3 MB/s eta 0:00:01
     -------------------------------- ------- 10.5/12.8 MB 7.4 MB/s eta 0:00:01
     -------------------------------------- - 12.3/12.8 MB 7.6 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 7.4 MB/s eta 0:00:00
Installing collected packages: en-core

### Download NLTK data

In [17]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to C:\Users\Cikal
[nltk_data]     Merdeka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Cikal
[nltk_data]     Merdeka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Cikal Merdeka\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Cikal Merdeka\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to C:\Users\Cikal
[nltk_data]     Merdeka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Cikal
[nltk_data]     Merdeka\A

True

## Basic NLP Tasks

### Tokenization

In [12]:
def tokenize_comparison(text):
    """Compare tokenization between NLTK and spaCy"""

    # Sample text
    print(f"Original text: {text}")

    # NLTK tokenization
    from nltk import word_tokenize
    nltk_tokens = word_tokenize(text)
    print(f"\nNLTK tokens: {nltk_tokens}")
    print(f"Number of tokens: {len(nltk_tokens)}")

    # spaCy tokenization
    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    spacy_tokens = [token.text for token in doc]
    print(f"\nspaCy tokens: {spacy_tokens}")
    print(f"Number of tokens: {len(spacy_tokens)}")
    
    # Compare tokenization results
    print("\nTokenization Comparison:")
    nltk_set = set(nltk_tokens)
    spacy_set = set(spacy_tokens)
    print(f"NLTK unique tokens: {nltk_set}")
    print(f"spaCy unique tokens: {spacy_set}")
    print(f"\nTokens in NLTK but not in spaCy: {nltk_set - spacy_set}")
    print(f"\nTokens in spaCy but not in NLTK: {spacy_set - nltk_set}")

In [13]:
# Example usage
text = "Mr. Smith paid $30.00 for the U.S.A. trip-package. It's worth it!"
tokenize_comparison(text)

Original text: Mr. Smith paid $30.00 for the U.S.A. trip-package. It's worth it!

NLTK tokens: ['Mr.', 'Smith', 'paid', '$', '30.00', 'for', 'the', 'U.S.A.', 'trip-package', '.', 'It', "'s", 'worth', 'it', '!']
Number of tokens: 15

spaCy tokens: ['Mr.', 'Smith', 'paid', '$', '30.00', 'for', 'the', 'U.S.A.', 'trip', '-', 'package', '.', 'It', "'s", 'worth', 'it', '!']
Number of tokens: 17

Tokenization Comparison:
NLTK unique tokens: {'worth', '$', '30.00', 'paid', 'for', '!', 'Mr.', "'s", 'trip-package', '.', 'the', 'It', 'it', 'U.S.A.', 'Smith'}
spaCy unique tokens: {'worth', 'package', '$', '30.00', '-', 'paid', 'for', '!', 'trip', 'Mr.', "'s", 'it', '.', 'the', 'It', 'U.S.A.', 'Smith'}

Tokens in NLTK but not in spaCy: {'trip-package'}

Tokens in spaCy but not in NLTK: {'trip', 'package', '-'}


### Part-of-Speech (POS) Tagging

#### POS Tag Explanation:

- **NLTK POS Tags (Penn Treebank tagset):**
    - DT: Determiner (e.g., the, a, an, this, that)
    - JJ: Adjective (e.g., quick, lazy, beautiful)
    - NN: Noun, singular or mass (e.g., fox, dog, house)
    - VBZ: Verb, 3rd person singular present (e.g., jumps, runs, eats)
    - IN: Preposition or subordinating conjunction (e.g., over, in, on, by)
    - VB: Verb, base form (e.g., go, run, eat)
    - VBD: Verb, past tense (e.g., went, ran, ate)
    - VBG: Verb, gerund or present participle (e.g., going, running, eating)
    - VBN: Verb, past participle (e.g., gone, run, eaten)
    - VBP: Verb, non-3rd person singular present (e.g., go, run, eat)
    - NNS: Noun, plural (e.g., foxes, dogs, houses)
    - NNP: Proper noun, singular (e.g., John, London, Microsoft)
    - NNPS: Proper noun, plural (e.g., Americans, Romans)
    - RB: Adverb (e.g., quickly, silently, well)
    - PRP: Personal pronoun (e.g., I, you, he, she, it)
    - PRP$: Possessive pronoun (e.g., my, your, his)
    - CC: Coordinating conjunction (e.g., and, but, or)
    - CD: Cardinal number (e.g., one, two, three)

- spaCy POS Tags (Universal Dependencies tagset):
    - DET: Determiner (corresponds to DT in Penn Treebank)
    - ADJ: Adjective (corresponds to JJ in Penn Treebank)
    - NOUN: Noun (corresponds to NN, NNS in Penn Treebank)
    - VERB: Verb (corresponds to VB, VBD, VBG, VBN, VBP, VBZ in Penn Treebank)
    - ADP: Adposition (preposition or postposition, corresponds to IN in Penn Treebank)
    - ADV: Adverb (corresponds to RB in Penn Treebank)
    - PRON: Pronoun (corresponds to PRP, PRP$ in Penn Treebank)
    - PROPN: Proper noun (corresponds to NNP, NNPS in Penn Treebank)
    - CCONJ: Coordinating conjunction (corresponds to CC in Penn Treebank)
    - NUM: Numeral (corresponds to CD in Penn Treebank)
    - PUNCT: Punctuation
    - SYM: Symbol
    - X: Other

Note: spaCy uses a coarse-grained tag (pos_) and a fine-grained tag (tag_).
The fine-grained tags often match the Penn Treebank tags used by NLTK.

In [18]:
def pos_tagging_comparison(text):
    """Compare part-of speech tagging between NLTK and spaCy"""

    # Sample text
    print(f"Original text: {text}")

    # NLTK POS tagging
    from nltk import word_tokenize
    from nltk import pos_tag

    nltk_tokens = word_tokenize(text)
    nltk_po_tags = pos_tag(nltk_tokens)

    print("\nNLTK POS Tags:")
    for token, tag in nltk_po_tags:
        print(f"{token}: {tag}")


    # spaCy POS tagging
    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)

    print("\nspaCy POS Tags:")
    for token in doc:
        # Print both simple POS and detailed POS
        print(f"{token.text}: {token.pos_} (fine-grained: {token.tag_})")

    # Note: NLTK uses Penn Treebank tagset, while spaCy uses Universal Dependencies

In [19]:
# Example usage
text = "The quick brown fox jumps over the lazy dog"
pos_tagging_comparison(text)

Original text: The quick brown fox jumps over the lazy dog

NLTK POS Tags:
The: DT
quick: JJ
brown: NN
fox: NN
jumps: VBZ
over: IN
the: DT
lazy: JJ
dog: NN

spaCy POS Tags:
The: DET (fine-grained: DT)
quick: ADJ (fine-grained: JJ)
brown: ADJ (fine-grained: JJ)
fox: NOUN (fine-grained: NN)
jumps: VERB (fine-grained: VBZ)
over: ADP (fine-grained: IN)
the: DET (fine-grained: DT)
lazy: ADJ (fine-grained: JJ)
dog: NOUN (fine-grained: NN)


### Lemmatization

In [15]:
def lemmatization_comparison(text):
    """Compare lemmatization between NLTK and spaCy"""
  
    print(f"Original text: {text}")
  
    # NLTK Lemmatization
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    from nltk import pos_tag
  
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
  
    # NLTK requires POS information for better lemmatization
    # We need to convert Penn Treebank tags to WordNet tags
    def get_wordnet_pos(treebank_tag):
        if treebank_tag.startswith('J'):
            return 'a'  # adjective
        elif treebank_tag.startswith('V'):
            return 'v'  # verb
        elif treebank_tag.startswith('N'):
            return 'n'  # noun
        elif treebank_tag.startswith('R'):
            return 'r'  # adverb
        else:
            return 'n'  # default to noun
  
    # Tokenize and get POS tags
    nltk_tokens = word_tokenize(text)
    nltk_pos = pos_tag(nltk_tokens)
  
    # Lemmatize with POS tags
    nltk_lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) 
                   for word, pos in nltk_pos]
  
    # Simple lemmatization (without POS, defaults to nouns)
    nltk_simple_lemmas = [lemmatizer.lemmatize(word) for word in nltk_tokens]
  
    print("\nNLTK lemmas (with POS):")
    for original, lemma in zip(nltk_tokens, nltk_lemmas):
        print(f"{original} -> {lemma}")
  
    print("\nNLTK simple lemmas (without POS):")
    for original, lemma in zip(nltk_tokens, nltk_simple_lemmas):
        print(f"{original} -> {lemma}")
  
    # spaCy Lemmatization
    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
  
    print("\nspaCy lemmas:")
    for token in doc:
        print(f"{token.text} -> {token.lemma_}")

In [16]:
# Example usage
text = "The cats are running and jumping over many boxes."
lemmatization_comparison(text)

Original text: The cats are running and jumping over many boxes.

NLTK lemmas (with POS):
The -> The
cats -> cat
are -> be
running -> run
and -> and
jumping -> jump
over -> over
many -> many
boxes -> box
. -> .

NLTK simple lemmas (without POS):
The -> The
cats -> cat
are -> are
running -> running
and -> and
jumping -> jumping
over -> over
many -> many
boxes -> box
. -> .

spaCy lemmas:
The -> the
cats -> cat
are -> be
running -> run
and -> and
jumping -> jump
over -> over
many -> many
boxes -> box
. -> .
