# Production-Grade NLP Preprocessing with SpaCy

**SpaCy is the industry standard for production NLP** - 10-100x faster than NLTK.

## Why SpaCy Over NLTK?

| Feature | SpaCy ‚úÖ | NLTK ‚ùå |
|---------|---------|--------|
| **Speed** | 10-100x faster | Slow |
| **Design** | Production-ready | Research/Education |
| **API** | Clean, consistent | Fragmented |
| **Pre-trained Models** | State-of-the-art | Minimal |
| **NER** | Excellent, built-in | Basic |
| **Dependency Parsing** | Built-in | Limited |
| **Companies Using** | Google, Meta, Microsoft | Academic mainly |
| **Memory** | Optimized | Higher usage |

## Table of Contents
1. Installation & Setup
2. Basic Pipeline
3. Tokenization
4. Lemmatization (No Stemming in SpaCy)
5. POS Tagging
6. Named Entity Recognition (NER)
7. Stopwords & Punctuation
8. Dependency Parsing
9. Custom Preprocessing Pipeline
10. Production Best Practices
11. Performance Optimization
12. Comparison: NLTK vs SpaCy Code

## 1. Installation & Setup

In [None]:
# Install SpaCy
# !pip install spacy

# Download models (choose based on your needs)
# !python -m spacy download en_core_web_sm   # 12MB - Fast, good for most tasks ‚≠ê
# !python -m spacy download en_core_web_md   # 40MB - Includes word vectors
# !python -m spacy download en_core_web_lg   # 560MB - Better accuracy
# !python -m spacy download en_core_web_trf  # 400MB - Best accuracy (transformer)

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import time

# Load model
nlp = spacy.load("en_core_web_sm")

print(f"‚úì SpaCy version: {spacy.__version__}")
print(f"‚úì Model loaded: en_core_web_sm")
print(f"\nPipeline components: {nlp.pipe_names}")
print("\nDefault pipeline stages:")
for name, component in nlp.pipeline:
    print(f"  {name:15s} ‚Üí {type(component).__name__}")

Looking in indexes: https://artifactory.f-sos.net/artifactory/api/pypi/pypi/simple
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
‚úì SpaCy version: 3.8.11
‚úì Model loaded: en_core_web_sm

P

## 2. Basic Usage - One Line Does Everything!

Unlike NLTK where you chain multiple functions, SpaCy does **everything in one call**.

In [5]:
# Sample text
text = """
Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976. 
The company is now worth over $2.5 trillion! It's revolutionizing technology.
Email: contact@apple.com | Website: https://www.apple.com
"""

# ONE call processes everything!
doc = nlp(text)

print("‚úì Text processed!\n")
print(doc)
print(f"Total tokens: {len(doc)}")
print(f"Total sentences: {len(list(doc.sents))}")
print(f"Total entities: {len(doc.ents)}")
print(f"\nType of result: {type(doc)}")
print("Doc objects contain ALL linguistic information!")

‚úì Text processed!


Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976. 
The company is now worth over $2.5 trillion! It's revolutionizing technology.
Email: contact@apple.com | Website: https://www.apple.com

Total tokens: 43
Total sentences: 5
Total entities: 6

Type of result: <class 'spacy.tokens.doc.Doc'>
Doc objects contain ALL linguistic information!


In [7]:
for i in doc.sents:
    print(i)


Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976. 

The company is now worth over $2.5 trillion!
It's revolutionizing technology.

Email: contact@apple.com
| Website: https://www.apple.com



In [8]:
for i in doc.ents:
    print(i)

Apple Inc.
Steve Jobs
Cupertino
California
April 1, 1976
over $2.5 trillion


## 3. Tokenization - Context-Aware & Smart

In [9]:
# SpaCy handles edge cases intelligently
test_texts = [
    "Dr. Smith isn't here. He's at N.Y.U.",
    "Email: test@email.com",
    "Price: $100.50",
    "Website: https://example.com",
    "don't won't can't"
]

print("SpaCy Tokenization (Context-Aware):\n")
for text in test_texts:
    doc = nlp(text)
    tokens = [token.text for token in doc]
    print(f"Text:   {text}")
    print(f"Tokens: {tokens}\n")

# Token attributes
doc = nlp("The quick brown fox jumps over 123 dogs!")
print("\nToken Attributes:")
print(f"{'Token':<10} {'Is Alpha':<10} {'Is Digit':<10} {'Is Stop':<10} {'Is Punct':<10}")
print("-" * 55)
for token in doc:
    print(f"{token.text:<10} {str(token.is_alpha):<10} {str(token.is_digit):<10} {str(token.is_stop):<10} {str(token.is_punct):<10}")

SpaCy Tokenization (Context-Aware):

Text:   Dr. Smith isn't here. He's at N.Y.U.
Tokens: ['Dr.', 'Smith', 'is', "n't", 'here', '.', 'He', "'s", 'at', 'N.Y.U.']

Text:   Email: test@email.com
Tokens: ['Email', ':', 'test@email.com']

Text:   Price: $100.50
Tokens: ['Price', ':', '$', '100.50']

Text:   Website: https://example.com
Tokens: ['Website', ':', 'https://example.com']

Text:   don't won't can't
Tokens: ['do', "n't", 'wo', "n't", 'ca', "n't"]


Token Attributes:
Token      Is Alpha   Is Digit   Is Stop    Is Punct  
-------------------------------------------------------
The        True       False      True       False     
quick      True       False      False      False     
brown      True       False      False      False     
fox        True       False      False      False     
jumps      True       False      False      False     
over       True       False      True       False     
123        False      True       False      False     
dogs       True       False 

## 4. Lemmatization - No Stemming in SpaCy!

SpaCy uses **ONLY lemmatization** (more accurate than stemming). Lemmas are always real words.

In [10]:
# Test words
test_words = [
    "running", "runs", "ran", "runner",
    "better", "best", "good",
    "studies", "studying", "studied",
    "mice", "geese", "feet", "children"
]

doc = nlp(" ".join(test_words))

print("Lemmatization Results:")
print(f"{'Original':<15} {'Lemma':<15} {'POS':<10}")
print("-" * 45)
for token in doc:
    print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10}")

# Apply to sentence
sentence = "The runners were running faster than they ran yesterday"
doc = nlp(sentence)
lemmatized = [token.lemma_ for token in doc]

print(f"\nOriginal:   {sentence}")
print(f"Lemmatized: {' '.join(lemmatized)}")

Lemmatization Results:
Original        Lemma           POS       
---------------------------------------------
running         run             VERB      
runs            run             NOUN      
ran             run             VERB      
runner          runner          NOUN      
better          well            ADV       
best            good            ADJ       
good            good            ADJ       
studies         study           NOUN      
studying        study           VERB      
studied         study           VERB      
mice            mouse           NOUN      
geese           geese           ADJ       
feet            foot            NOUN      
children        child           NOUN      

Original:   The runners were running faster than they ran yesterday
Lemmatized: the runner be run fast than they run yesterday


## 5. Part-of-Speech (POS) Tagging - Highly Accurate

In [None]:
sentence = "The quick brown fox jumps over the lazy dog"
doc = nlp(sentence)

print("POS Tagging:")
print(f"{'Token':<12} {'POS':<8} {'Tag':<8} {'Description':<30}")
print("-" * 65)

for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.tag_:<8} {spacy.explain(token.tag_):<30}")

# Extract by POS
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
verbs = [token.text for token in doc if token.pos_ == "VERB"]
adjectives = [token.text for token in doc if token.pos_ == "ADJ"]

print(f"\nNouns:      {nouns}")
print(f"Verbs:      {verbs}")
print(f"Adjectives: {adjectives}")

## 6. Named Entity Recognition (NER) - Production Quality

SpaCy's NER is **far superior** to NLTK. Used by major tech companies.

In [None]:
text = """
Apple Inc. CEO Tim Cook announced in Cupertino that the company earned $365.8 billion in 2021.
Microsoft and Google are also based in the United States. 
On January 15, 2024, the European Union imposed new regulations.
"""

doc = nlp(text)

print("Named Entities Detected:\n")
print(f"{'Entity':<30} {'Label':<15} {'Description':<40}")
print("-" * 90)

for ent in doc.ents:
    print(f"{ent.text:<30} {ent.label_:<15} {spacy.explain(ent.label_):<40}")

# Extract by entity type
organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
dates = [ent.text for ent in doc.ents if ent.label_ == "DATE"]
money = [ent.text for ent in doc.ents if ent.label_ == "MONEY"]
people = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

print(f"\nOrganizations: {organizations}")
print(f"People:        {people}")
print(f"Locations:     {locations}")
print(f"Dates:         {dates}")
print(f"Money:         {money}")

## 7. Stopwords & Punctuation Removal

In [None]:
print(f"Total stopwords in SpaCy: {len(STOP_WORDS)}")
print(f"Sample stopwords: {list(STOP_WORDS)[:20]}\n")

text = "The quick brown fox jumps over the lazy dog! It's amazing!!!"
doc = nlp(text)

print(f"Original: {text}\n")

# Remove stopwords
without_stopwords = [token.text for token in doc if not token.is_stop]
print(f"Without stopwords: {' '.join(without_stopwords)}")

# Remove punctuation
without_punct = [token.text for token in doc if not token.is_punct]
print(f"Without punct:     {' '.join(without_punct)}")

# Remove both
cleaned = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(f"Both removed:      {' '.join(cleaned)}")

# Only alphabetic
alpha_only = [token.text for token in doc if token.is_alpha]
print(f"Only alphabetic:   {' '.join(alpha_only)}")

# Content words only
content = [token.text for token in doc 
           if not token.is_stop and not token.is_punct and token.is_alpha]
print(f"Content words:     {' '.join(content)}")

## 8. Dependency Parsing - Understand Grammar

This is where SpaCy **crushes** NLTK!

In [None]:
sentence = "The CEO of Apple announced new products yesterday"
doc = nlp(sentence)

print("Dependency Parsing:\n")
print(f"{'Token':<12} {'Dependency':<12} {'Head':<12} {'Children':<30}")
print("-" * 70)

for token in doc:
    children = [child.text for child in token.children]
    print(f"{token.text:<12} {token.dep_:<12} {token.head.text:<12} {', '.join(children):<30}")

# Extract Subject-Verb-Object
print("\nSubject-Verb-Object Extraction:")
for token in doc:
    if token.dep_ == "ROOT":  # Main verb
        subject = [child.text for child in token.children if child.dep_ == "nsubj"]
        obj = [child.text for child in token.children if child.dep_ == "dobj"]
        print(f"Subject: {subject}")
        print(f"Verb:    {token.text}")
        print(f"Object:  {obj}")

## 9. Production-Ready Preprocessing Pipeline

In [None]:
def preprocess_spacy(text, 
                     lowercase=True,
                     remove_stopwords=True,
                     remove_punct=True,
                     remove_emails=True,
                     remove_urls=True,
                     lemmatize=True,
                     only_alpha=True,
                     min_token_len=2):
    """
    Production-grade text preprocessing with SpaCy
    
    Parameters:
    -----------
    text : str - Input text
    lowercase : bool - Convert to lowercase
    remove_stopwords : bool - Remove stopwords
    remove_punct : bool - Remove punctuation
    remove_emails : bool - Remove emails
    remove_urls : bool - Remove URLs
    lemmatize : bool - Lemmatize tokens
    only_alpha : bool - Keep only alphabetic
    min_token_len : int - Minimum token length
    
    Returns:
    --------
    list : Processed tokens
    """
    # Process with SpaCy
    doc = nlp(text.lower() if lowercase else text)
    
    tokens = []
    for token in doc:
        # Skip based on filters
        if remove_stopwords and token.is_stop:
            continue
        if remove_punct and token.is_punct:
            continue
        if remove_urls and token.like_url:
            continue
        if remove_emails and token.like_email:
            continue
        if only_alpha and not token.is_alpha:
            continue
        if len(token.text) < min_token_len:
            continue
        
        # Add lemma or original
        tokens.append(token.lemma_ if lemmatize else token.text)
    
    return tokens


# Test
test_text = """
Natural Language Processing (NLP) is AMAZING!!! 
Visit https://spacy.io for more info. Contact: test@email.com
The researchers are studying advanced AI techniques.
"""

print("Original:")
print(test_text)
print("\n" + "="*70 + "\n")

print("Full preprocessing:")
result = preprocess_spacy(test_text)
print(result)
print(f"Token count: {len(result)}")

print("\n" + "="*70 + "\n")

print("Minimal (keep stopwords):")
result_min = preprocess_spacy(test_text, remove_stopwords=False)
print(result_min)
print(f"Token count: {len(result_min)}")

## 10. Batch Processing - 10x Faster!

For processing multiple documents, use `nlp.pipe()` - it's **10x faster** than loops.

In [None]:
# Sample documents
documents = [
    "Apple is releasing new products",
    "Microsoft announced quarterly earnings",
    "Google's AI research is advancing",
    "Amazon dominates cloud computing",
    "Tesla's stock price increased"
] * 100  # 500 documents

print(f"Processing {len(documents)} documents...\n")

# BAD: One-by-one (slow)
start = time.time()
docs_slow = [nlp(text) for text in documents]
time_slow = time.time() - start
print(f"‚ùå One-by-one:  {time_slow:.3f} seconds")

# GOOD: Batch with nlp.pipe()
start = time.time()
docs_fast = list(nlp.pipe(documents, batch_size=50))
time_fast = time.time() - start
print(f"‚úÖ Batch (pipe): {time_fast:.3f} seconds")
print(f"\nüöÄ Speedup: {time_slow/time_fast:.1f}x faster!")

# Extract entities efficiently
all_entities = []
for doc in nlp.pipe(documents[:10], batch_size=5):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    all_entities.extend(entities)

print(f"\nEntities found: {len(all_entities)}")
print(f"Sample: {all_entities[:5]}")

## 11. Disable Unused Components - Optimize Performance

In [None]:
# Full pipeline (slower)
nlp_full = spacy.load("en_core_web_sm")
print(f"Full pipeline: {nlp_full.pipe_names}")

# Minimal pipeline (faster) - only tokenizer
nlp_minimal = spacy.load("en_core_web_sm", disable=["parser", "ner"])
print(f"Minimal: {nlp_minimal.pipe_names}")

# NER only
nlp_ner = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
print(f"NER only: {nlp_ner.pipe_names}")

# Benchmark
text = "Apple Inc. is a technology company" * 100

start = time.time()
doc = nlp_full(text)
full_time = time.time() - start

start = time.time()
doc = nlp_minimal(text)
minimal_time = time.time() - start

print(f"\nPerformance:")
print(f"Full pipeline:    {full_time:.4f}s")
print(f"Minimal pipeline: {minimal_time:.4f}s")
print(f"Speedup:          {full_time/minimal_time:.2f}x")

## 12. SpaCy vs NLTK - Direct Comparison

In [None]:
text = "The researchers are studying NLP techniques in California"

print("=" * 70)
print("NLTK Way (Multiple steps, complex):")
print("=" * 70)
print("""
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Step 1: Tokenize
tokens = word_tokenize(text.lower())

# Step 2: Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words and w.isalpha()]

# Step 3: Lemmatize
lemmatizer = WordNetLemmatizer()
result = [lemmatizer.lemmatize(w) for w in filtered]
""")

print("\n" + "=" * 70)
print("SpaCy Way (One line, simple):")
print("=" * 70)

# SpaCy: ONE line!
doc = nlp(text.lower())
spacy_result = [token.lemma_ for token in doc 
                if not token.is_stop and not token.is_punct and token.is_alpha]

print(f"\nResult: {spacy_result}")

print("\n" + "=" * 70)
print("SpaCy Bonus (NLTK doesn't have):")
print("=" * 70)
print(f"Entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
print(f"POS tags: {[(token.text, token.pos_) for token in doc][:5]}")
print(f"Dependencies: {[(token.text, token.dep_) for token in doc][:5]}")

## 13. Real-World Use Cases

### Use Case 1: Text Classification

In [None]:
def preprocess_for_classification(text):
    """Preprocess for ML classification"""
    doc = nlp(text.lower())
    return [token.lemma_ for token in doc 
            if not token.is_stop 
            and not token.is_punct 
            and token.is_alpha
            and len(token) > 2]

reviews = [
    "This product is amazing! Best purchase ever.",
    "Terrible quality. Very disappointed.",
    "Average product, nothing special."
]

print("Text Classification Preprocessing:\n")
for review in reviews:
    processed = preprocess_for_classification(review)
    print(f"Original:  {review}")
    print(f"Processed: {processed}\n")

### Use Case 2: Information Extraction

In [None]:
def extract_key_info(text):
    """Extract structured info from text"""
    doc = nlp(text)
    
    return {
        'organizations': [ent.text for ent in doc.ents if ent.label_ == "ORG"],
        'people': [ent.text for ent in doc.ents if ent.label_ == "PERSON"],
        'locations': [ent.text for ent in doc.ents if ent.label_ == "GPE"],
        'dates': [ent.text for ent in doc.ents if ent.label_ == "DATE"],
        'money': [ent.text for ent in doc.ents if ent.label_ == "MONEY"]
    }

article = """
Tesla CEO Elon Musk announced on March 15, 2024 that the company 
will invest $10 billion in a new factory in Austin, Texas.
"""

info = extract_key_info(article)
print("Extracted Information:\n")
for key, values in info.items():
    if values:
        print(f"{key.capitalize()}: {values}")

### Use Case 3: Search/Retrieval

In [None]:
def preprocess_for_search(query, document):
    """Preprocess query and document for search"""
    def process(text):
        doc = nlp(text.lower())
        # Keep content words (nouns, verbs, adjectives)
        return [token.lemma_ for token in doc 
                if token.pos_ in ["NOUN", "VERB", "ADJ", "PROPN"]
                and not token.is_stop]
    
    return {
        'query': process(query),
        'document': process(document)
    }

query = "best machine learning courses"
document = "Learn machine learning with our comprehensive courses. The best way to master ML."

result = preprocess_for_search(query, document)
print("Search Preprocessing:")
print(f"Query:    {result['query']}")
print(f"Document: {result['document']}")

# Calculate overlap
overlap = set(result['query']) & set(result['document'])
print(f"\nMatching: {overlap}")
print(f"Score: {len(overlap)/len(result['query']):.2f}")

## 14. Production Best Practices

### ‚úÖ DO's

1. **Use `nlp.pipe()` for batches** - 10x faster than loops
2. **Disable unused components** - saves memory/time
3. **Choose right model**:
   - `sm`: Fast, most tasks
   - `md`: Word vectors needed
   - `lg`: Better accuracy
   - `trf`: Best accuracy (slower)
4. **Cache nlp object** - load once, reuse
5. **Work with Doc/Token objects** - don't convert to strings

### ‚ùå DON'Ts

1. **Don't process one-by-one** - use `nlp.pipe()`
2. **Don't load model in loops** - load once
3. **Don't use string operations** - use token attributes
4. **Don't convert to strings unnecessarily**
5. **Don't load full pipeline if not needed**

### Task-Specific Configurations

| Task | Model | Components |
|------|-------|------------|
| Classification | sm/md | tok2vec, tagger, lemmatizer |
| NER | lg/trf | tok2vec, ner |
| Sentiment | md | tok2vec, tagger, lemmatizer |
| Parsing | lg | all |
| Tokenization | sm | tokenizer only |

## 15. Key Takeaways

### Why SpaCy is Industry Standard

1. **Speed**: 10-100x faster than NLTK
2. **Accuracy**: State-of-the-art models
3. **Simplicity**: One call does everything
4. **Features**: NER, parsing, vectors built-in
5. **Production**: Used by Google, Meta, Microsoft

### Migration from NLTK

```python
# OLD (NLTK) - Multiple steps
tokens = word_tokenize(text.lower())
tokens = [w for w in tokens if w not in stopwords.words('english')]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(w) for w in tokens]

# NEW (SpaCy) - One line
doc = nlp(text.lower())
tokens = [token.lemma_ for token in doc if not token.is_stop]
```

### Resources

- Documentation: https://spacy.io
- Free Course: https://course.spacy.io
- Models: https://spacy.io/models
- GitHub: https://github.com/explosion/spaCy

### Next Steps

1. Install SpaCy and download models
2. Replace NLTK code with SpaCy
3. Optimize with `nlp.pipe()` and component disabling
4. Explore word vectors and similarity
5. Deploy to production! üöÄ