<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - spaCy<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1: Introduction

## ❓ What is spaCy? ([spaCy Official Website](https://spacy.io))

**spaCy** is a modern, fast, and industrial-strength NLP library written in Python and Cython. It provides high-performance pipelines and pretrained models for many key NLP tasks.

---

## ✅ When to Use spaCy

- **Production Applications:** Optimized for speed and scalability.
- **Pretrained Models:** Ready-to-use, high-accuracy models for tokenization, tagging, parsing, NER, and more.
- **Deep Learning Integration:** Seamlessly integrates with transformer libraries (e.g. via `spacy-transformers`).
- **Modular Pipelines:** Customize and extend via simple APIs.

---

## 🚫 spaCy might not be ideal for:

- **Teaching Classical NLP:** NLTK is better for learning symbolic NLP tasks like treebanks or CFG parsing.
- **Highly Custom Tokenization or Parsing Logic:** spaCy offers custom rules, but for very experimental or rule-heavy systems, NLTK may still be better.
- **Resource-Constrained Environments:** Its models can be heavier than NLTK's rule-based ones.

## ⚙️ Installation & Setup

In [90]:
# Install spaCy and the small English model
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [91]:
import spacy

In [92]:
spacy.cli.download('en_core_web_md')

Collecting en-core-web-md==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# 💡 Part 2: Primer — What are `nlp` and `doc`

Before we explore spaCy’s features, let’s clarify two core concepts:

## 🔹 `nlp`: The Processing Pipeline

- Loads a **pretrained English model** with tokenization, tagging, parsing, NER, etc.
- `nlp` is a **callable pipeline**: pass text in, get a processed `Doc`.

In [93]:
nlp = spacy.load('en_core_web_sm')

## 🔹 `doc`: The Processed Text

- A `doc` is a `spacy.tokens.Doc` object, representing the full processed text.
- It contains:
    - Tokens: `doc[i]`
    - Sentences: `list(doc.sents)`
    - Entities: `doc.ents`
    - Annotations: POS tags, lemmas, dependencies, etc.

In [94]:
doc = nlp("Apple is looking at buying a startup in the UK.")

## 🔍 Quick Check

In [95]:
# 1. Tokens
print('Example token [doc[0]]:', doc[0], '→', type(doc[0]))

# 2. Sentences
print('Sentences [list(doc.sents)]:', list(doc.sents))

# 3. Entities
print('Entities [doc.ents]:', [(ent.text, ent.label_) for ent in doc.ents])

# 4. Annotations (POS, Lemma, Dependency)
print('\nAnnotations:')
for token in doc:
    print(f'{token.text:12} POS: {token.pos_:6}  Lemma: {token.lemma_:10}  Dep: {token.dep_:6}')

Example token [doc[0]]: Apple → <class 'spacy.tokens.token.Token'>
Sentences [list(doc.sents)]: [Apple is looking at buying a startup in the UK.]
Entities [doc.ents]: [('Apple', 'ORG'), ('UK', 'GPE')]

Annotations:
Apple        POS: PROPN   Lemma: Apple       Dep: nsubj 
is           POS: AUX     Lemma: be          Dep: aux   
looking      POS: VERB    Lemma: look        Dep: ROOT  
at           POS: ADP     Lemma: at          Dep: prep  
buying       POS: VERB    Lemma: buy         Dep: pcomp 
a            POS: DET     Lemma: a           Dep: det   
startup      POS: NOUN    Lemma: startup     Dep: dobj  
in           POS: ADP     Lemma: in          Dep: prep  
the          POS: DET     Lemma: the         Dep: det   
UK           POS: PROPN   Lemma: UK          Dep: pobj  
.            POS: PUNCT   Lemma: .           Dep: punct 


In [96]:
spacy.explain('GPE'), spacy.explain('PROPN'), spacy.explain('nsubj'), spacy.explain('aux')

('Countries, cities, states', 'proper noun', 'nominal subject', 'auxiliary')

# 🧠 Part 3: Core Linguistic Features

## 🔹 3.1 Tokenization & Sentence Segmentation

In [97]:
text = "SpaCy is a powerful NLP library. It’s designed for production use."
doc = nlp(text)

# Sentences
print('Sentences:', [sent.text for sent in doc.sents])

# Word tokens
print('Tokens:', [token.text for token in doc])

Sentences: ['SpaCy is a powerful NLP library.', 'It’s designed for production use.']
Tokens: ['SpaCy', 'is', 'a', 'powerful', 'NLP', 'library', '.', 'It', '’s', 'designed', 'for', 'production', 'use', '.']


## 🔹 3.2 POS Tagging & Morphology

### 🆚 `token.pos_` vs `token.tag_` — What's the Difference?

spaCy provides **two levels of part-of-speech (POS) tagging** for each token:

#### 🔹 `token.pos_`: Universal POS Tag
- This gives a **coarse-grained, language-independent** part of speech.
- Comes from the **Universal POS tag set** (used across many languages).
- Examples:
  - `'NOUN'`, `'VERB'`, `'ADJ'`, `'ADV'`, `'PROPN'`, `'AUX'`, `'DET'`, `'ADP'`

📌 Use this when you want a **high-level, consistent POS tag** (e.g., when training multilingual models).

#### 🔹 `token.tag_`: Language-Specific Detailed Tag
- This gives a **fine-grained, language-specific** POS tag.
- In English models, it uses the **Penn Treebank tag set**.
- Examples:
  - `'NN'` (singular noun), `'NNS'` (plural noun), `'VBZ'` (3rd person singular verb), `'VBN'` (past participle), `'JJ'` (adjective)

📌 Use this when you need **grammatical detail** (e.g., distinguishing singular vs. plural, tense, voice).

In [98]:
sample = "She walks."
sample_doc = nlp(sample)
sample_token = sample_doc[1]  # "walks"

print('sample_token.text:', sample_token.text)
print('sample_token.pos_:', sample_token.pos_)
print('sample_token.tag_:', sample_token.tag_)

sample_token.text: walks
sample_token.pos_: VERB
sample_token.tag_: VBZ


### 📖 Understanding Morphological Features

In spaCy, the `token.morph` attribute provides **morphological information** about each token—i.e., the grammatical features that describe how a word is inflected. These features include:

- **Number** (e.g., `Number=Sing` for singular, `Number=Plur` for plural)
- **Tense** (e.g., `Tense=Pres` for present, `Tense=Past` for past)
- **Mood** (e.g., `Mood=Ind` for indicative, `Mood=Imp` for imperative)
- **Person** (e.g., `Person=3` for third person)
- **Aspect** (e.g., `Aspect=Perf` for perfective)
- **Degree** (for adjectives/adverbs, e.g., `Degree=Pos` for positive, `Degree=Cmp` for comparative)
- **Definiteness** and **Pronoun Types** (for determiners/pronouns)
- **Verb Form** (e.g., `VerbForm=Fin` for finite forms, `VerbForm=Part` for participles)
- **PunctType** (for punctuation, e.g., `PunctType=Peri` for period)

These features are returned as a pipe-separated string.

In [99]:
for token in doc:
    print(f'{token.text:12} POS: {token.pos_:8} Tag: {token.tag_:6} Morph: {token.morph}')

SpaCy        POS: PROPN    Tag: NNP    Morph: Number=Sing
is           POS: AUX      Tag: VBZ    Morph: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a            POS: DET      Tag: DT     Morph: Definite=Ind|PronType=Art
powerful     POS: ADJ      Tag: JJ     Morph: Degree=Pos
NLP          POS: PROPN    Tag: NNP    Morph: Number=Sing
library      POS: NOUN     Tag: NN     Morph: Number=Sing
.            POS: PUNCT    Tag: .      Morph: PunctType=Peri
It           POS: PRON     Tag: PRP    Morph: Gender=Neut|Number=Sing|Person=3|PronType=Prs
’s           POS: AUX      Tag: VBD    Morph: Tense=Past|VerbForm=Fin
designed     POS: VERB     Tag: VBN    Morph: Aspect=Perf|Tense=Past|VerbForm=Part
for          POS: ADP      Tag: IN     Morph: 
production   POS: NOUN     Tag: NN     Morph: Number=Sing
use          POS: NOUN     Tag: NN     Morph: Number=Sing
.            POS: PUNCT    Tag: .      Morph: PunctType=Peri


In [100]:
spacy.explain('AUX'), spacy.explain('VBZ')

('auxiliary', 'verb, 3rd person singular present')

## 🔹 3.3 Lemmatization

In [101]:
for token in doc:
    print(f'{token.text:12} Lemma: {token.lemma_}')

SpaCy        Lemma: SpaCy
is           Lemma: be
a            Lemma: a
powerful     Lemma: powerful
NLP          Lemma: NLP
library      Lemma: library
.            Lemma: .
It           Lemma: it
’s           Lemma: ’s
designed     Lemma: design
for          Lemma: for
production   Lemma: production
use          Lemma: use
.            Lemma: .


## 🔹 3.4 Dependency Parsing

In [102]:
from spacy import displacy

# Visualize dependency tree in the notebook
displacy.render(doc, style='dep', jupyter=True)

In [103]:
for token in doc:
    print(f'{token.text:12} Head: {token.head.text:12} Dep: {token.dep_}')

SpaCy        Head: is           Dep: nsubj
is           Head: is           Dep: ROOT
a            Head: library      Dep: det
powerful     Head: library      Dep: amod
NLP          Head: library      Dep: compound
library      Head: is           Dep: attr
.            Head: is           Dep: punct
It           Head: designed     Dep: nsubjpass
’s           Head: designed     Dep: auxpass
designed     Head: designed     Dep: ROOT
for          Head: designed     Dep: prep
production   Head: use          Dep: compound
use          Head: for          Dep: pobj
.            Head: designed     Dep: punct


## 🔹 3.5 Named Entity Recognition (NER)

In [104]:
for ent in doc.ents:
    print(f'{ent.text:20} Label: {ent.label_}')

SpaCy                Label: PERSON
NLP                  Label: ORG


In [105]:
# style (str): Visualisation style, 'dep' or 'ent'.
displacy.render(doc, style='ent', jupyter=True)

## 🔹 3.6 Sentence Segmentation

By default, spaCy splits text into sentences based on the model’s built-in rules and punctuation

In [106]:
text = "SpaCy is amazing. It can detect sentence boundaries accurately—even with abbreviations like U.S.A."
doc = nlp(text)

# Print each sentence separately
for i, sent in enumerate(doc.sents, 1):
    print(f"Sentence {i}:", sent.text)

Sentence 1: SpaCy is amazing.
Sentence 2: It can detect sentence boundaries accurately—even with abbreviations like U.S.A.


## 🔹 3.7 Vectors & Similarity
> Requires a model with word vectors, e.g. `en_core_web_md` or `en_core_web_lg`.

In [107]:
doc1 = nlp("NLP")
doc2 = nlp("natural language processing")

# Compute cosine similarity between the two Doc vectors
print(f"Similarity: {doc1.similarity(doc2):.2f}")

Similarity: 0.22


  print(f"Similarity: {doc1.similarity(doc2):.2f}")


In [108]:
nlp_md = spacy.load('en_core_web_md')

doc1 = nlp_md("NLP")
doc2 = nlp_md("natural language processing")

# Compute cosine similarity between the two Doc vectors
print(f"Similarity: {doc1.similarity(doc2):.2f}")

Similarity: 0.09


# 🤖 Part 4: Rule-based Matching

spaCy allows you to find words, patterns, and structures in text using rule-based matchers. These are very fast and useful when:
- You don’t need machine learning
- You want precise, custom control over matching behavior

In [109]:
nlp = spacy.load('en_core_web_sm')

## 🔹 4.1 What is `nlp.vocab`?

When you load a model with `nlp = spacy.load('en_core_web_sm')`, spaCy creates a shared vocabulary object:

In [110]:
vocab = nlp.vocab
vocab

<spacy.vocab.Vocab at 0x1496d5f30>

This object, of type `spacy.vocab.Vocab`, contains all language-specific data used across the NLP pipeline. It is:

- **Shared by all components** (tokenizer, parser, tagger, matcher, etc.)
- **Memory-efficient:** words are mapped to IDs instead of being repeated as strings

### 📦 `nlp.vocab` includes:

- **Lexemes** – Basic word representations (with spelling, shape, etc.)
- **StringStore** – Maps strings to IDs and vice versa
- **Word Vectors** – If available, dense embeddings for similarity
- **Lookups** – Custom mappings (e.g., for lemmatization)

In [111]:
lex = nlp_md.vocab['apple']

# lex.is_alpha -> Is this word made of alphabetic characters only?
# lex.shape_ -> The shape of the word (e.g., "Xxxx" for "Apple")
print(lex.text, lex.is_alpha, lex.shape_)
print(lex.vector)

print(nlp.vocab.strings['dog'])

apple True xxxx
[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407

In [112]:
print(nlp.vocab.strings[7562983679033046312])

dog


In [113]:
print(nlp.vocab['Apple'].shape_)
print(nlp.vocab['NLP'].shape_)
print(nlp.vocab['2023'].shape_)

Xxxxx
XXX
dddd


## 🔹 4.2 Token Pattern Matching with `Matcher`

In [114]:
from spacy.matcher import Matcher

text = "Apple is looking to buy a U.K. startup for $1 billion."
doc = nlp(text)

# Match pattern: a proper noun followed by a verb
pattern = [
    {"POS": "PROPN"},
    {"POS": "AUX"},
    {"POS": "VERB"}
]

matcher = Matcher(nlp.vocab)
matcher.add("PROPN_VERB_PATTERN", [pattern])
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print("Match found:", span.text)

Match found: Apple is looking


You can see available token attributes [here](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) and [here](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes-extended).

In [115]:
p1 = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
p2 = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
    [{"LOWER": "hello"}, {"LOWER": "world"}]
]
p3 = [{"TAG": {"REGEX": "^V"}}]

## 🔹 4.3 Phrase Matching with PhraseMatcher

In [116]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
phrases = ["natural language processing", "deep learning", "machine learning"]
patterns = [nlp(text) for text in phrases]

matcher.add("AI_PHRASES", patterns)

doc = nlp("I love natural language processing and machine learning.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Phrase match:", doc[start:end].text)

Phrase match: natural language processing
Phrase match: machine learning


## 🔹 4.4 Rule-based NER with EntityRuler

You can use an `EntityRuler` to create **custom named entities** using patterns.

In [117]:
patterns = [
    {"label": "SOFTWARE", "pattern": "spaCy"},
    {"label": "ORG", "pattern": [{"LOWER": "openai"}]},
]

ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns(patterns)

doc = nlp("I use spaCy and OpenAI tools.")
for ent in doc.ents:
    print(ent.text, "→", ent.label_)

spaCy → SOFTWARE
OpenAI → ORG


# 🔄 Part 5: Processing Pipelines

spaCy’s processing flow is modular and transparent. The `nlp` object you’ve been using is actually a **pipeline of components**, and each one performs a task like tokenization, tagging, parsing, or NER.

You can inspect, modify, and even extend this pipeline to add your own logic.

## 🔹 5.1 View the Pipeline

In [118]:
# Print the names and order of pipeline components
print(nlp.pipe_names)
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x13d747a00>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x13d747dc0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x13ee78190>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x13eabae80>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x118996c00>), ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler object at 0x11ab0ce40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x13ee78120>)]


## 🔹 5.2 Run Individual Components

Each component in the pipeline can be run individually using `nlp.get_pipe()`.

In [119]:
doc = nlp("OpenAI develops advanced AI models.")
tagger = nlp.get_pipe("tagger")
tagger(doc)

print([(token.text, token.pos_) for token in doc])

[('OpenAI', 'PROPN'), ('develops', 'VERB'), ('advanced', 'ADJ'), ('AI', 'PROPN'), ('models', 'NOUN'), ('.', 'PUNCT')]


## 🔹 5.3 Reordering Components

In [120]:
nlp2 = spacy.load('en_core_web_sm')

ner = nlp2.remove_pipe('ner')
nlp2.add_pipe('ner', after='parser')

print(nlp2.pipe_names)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']


## 🔹 5.4 Disable Components Temporarily

To speed things up or test specific behavior, you can disable parts of the pipeline:

In [121]:
print('NLP pipeline before disabling:', nlp.pipe_names)

with nlp.select_pipes(disable=["ner", "parser"]):
    print('NLP pipeline after disabling:', nlp.pipe_names)

    doc = nlp("Google is based in California.")
    print([(token.text, token.pos_) for token in doc])  # No NER or parsing

NLP pipeline before disabling: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']
NLP pipeline after disabling: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'entity_ruler']
[('Google', 'PROPN'), ('is', 'AUX'), ('based', 'VERB'), ('in', 'ADP'), ('California', 'PROPN'), ('.', 'PUNCT')]


## 🔹 5.5 Add a Custom Component

You can insert your own logic into the pipeline using the `@Language.component` decorator.

In [122]:
from spacy.language import Language


@Language.component("print_token_count")
def print_token_count(doc):
    print(f"Processed {len(doc)} tokens.")
    return doc


# Add it to the end of the pipeline
# You can insert your component first=True, last=True, or before='ner', after='parser', etc.
nlp.add_pipe("print_token_count", last=True)

# Now every time you run `nlp`, the component is triggered
doc = nlp("This sentence will trigger our custom component.")

Processed 8 tokens.


# 🔚 Part 6: Summary & Comparison


## ✅ What We’ve Learned in This spaCy Section

| Topic                           | What We Did                                                                                  |
|---------------------------------|----------------------------------------------------------------------------------------------|
| **Tokenization & Segmentation** | Used `doc = nlp(text)` and `doc.sents` to extract tokens and sentences                       |
| **POS Tagging & Morphology**    | Accessed `token.pos_`, `token.tag_`, and `token.morph`                                       |
| **Lemmatization**               | Retrieved `token.lemma_`                                                                     |
| **Dependency Parsing**          | Visualized with `displacy.render(..., style='dep')` and inspected `token.dep_`, `token.head` |
| **Named Entity Recognition**    | Extracted `doc.ents` and visualized with `displacy.render(..., style='ent')`                 |
| **Rule-based Matching**         | Matched patterns using `Matcher`, `PhraseMatcher`, and `EntityRuler`                         |
| **Processing Pipelines**        | Inspected `nlp.pipe_names`, disabled components, and added a custom component                |

## 🔁 Comparison: NLTK vs. spaCy

| Feature                       | NLTK                                            | spaCy                                         |
|-------------------------------|-------------------------------------------------|-----------------------------------------------|
| **Language support**          | English; limited others                         | Multilingual; many high-quality models        |
| **Tokenization**              | Manual with `punkt`; slower                     | Fast, built-in tokenizer                      |
| **POS tagging**               | Classical tagger (`averaged_perceptron_tagger`) | Statistical models; Universal & detailed tags |
| **Morphology**                | Basic lemmatization; no morph features          | Rich `.morph` attribute                       |
| **Dependency parsing**        | Available but slower                            | Fast, accurate parser                         |
| **Named Entity Recognition**  | Rule-based `ne_chunk`; limited accuracy         | ML-based NER + `EntityRuler`                  |
| **Rule-based matching**       | Not built-in                                    | `Matcher`, `PhraseMatcher`, `EntityRuler`     |
| **Processing pipeline**       | Not modular by default                          | Modular `nlp` pipeline; easy to extend        |
| **Word vectors & similarity** | None                                            | Supported in medium/large models              |
| **Ease of use**               | Great for learning                              | Great for production and prototyping          |

> 📌 **TL;DR:**
> - **NLTK** is ideal for learning classical NLP concepts and quick prototyping.
> - **spaCy** excels in performance, modularity, and modern, production-ready NLP workflows.

# 🧪 Part 7: Exercises

## 🔹 7.1 Write a `preprocess` function

Create a Python function with this signature:

In [1]:
from spacy.language import Language


def preprocess(text: str, nlp: Language) -> list[str]:
    """
    1. Run the text through nlp to get a Doc.
    2. For each token, keep only alphabetic tokens that are not stopwords
       and whose POS is one of NOUN, VERB, ADJ, or ADV.
    3. Return a list of token.lemma_.lower() values.
    """
    # Your code here
    pass

In [124]:
text = "SpaCy pipelines make it easy to build robust NLP systems."
print(preprocess(text, nlp))
# Expected: ['spacy', 'pipeline', 'easy', 'build', 'robust', 'system']

None


### ✅ Answer

In [125]:
from spacy.language import Language


def preprocess(text: str, nlp: Language) -> list[str]:
    """
    1. Run the text through nlp to get a Doc.
    2. Keep only alphabetic tokens that are not stopwords
       and whose POS is one of NOUN, VERB, ADJ, or ADV.
    3. Return a list of token.lemma_.lower() values.
    """
    doc = nlp(text)

    my_list = []
    for token in doc:
        if token.is_alpha and not token.is_stop and token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV'}:
            my_list.append(token.lemma_.lower())

    return my_list

In [126]:
text = "SpaCy pipelines make it easy to build robust NLP systems."
print(preprocess(text, nlp))

Processed 11 tokens.
['spacy', 'pipeline', 'easy', 'build', 'robust', 'system']


## 🔹 7.2 Turn `preprocess` into a spaCy component
Use the `@Language.component` decorator to wrap your function so it can be inserted into `nlp.pipeline`:

In [127]:
nlp = spacy.load("en_core_web_sm")

In [128]:
from spacy.language import Language
from spacy.tokens import Doc


@Language.component("simple_preprocessor")
def simple_preprocessor(doc: Doc) -> Doc:
    """
    1. Apply your preprocessing logic to doc
    2. Store the result on the Doc, e.g. doc.user_data["tokens"] = [...]
    """
    return doc


# Add it to the pipeline before "ner"
nlp.add_pipe("simple_preprocessor", before="ner")

<function __main__.simple_preprocessor(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [129]:
text = "SpaCy pipelines make it easy to build robust NLP systems."
doc = nlp(text)
print(doc.user_data["tokens"] if "tokens" in doc.user_data else "No tokens found.")
# Expected: ['spacy', 'pipeline', 'easy', 'build', 'robust', 'system']

No tokens found.


### ✅ Answer

In [130]:
from spacy.language import Language


@Language.component('simple_preprocessor')
def simple_preprocessor(doc):
    """
    1. Apply your preprocessing logic to doc
    2. Store the result on the Doc, e.g. doc.user_data["tokens"] = [...]
    """

    my_list = []
    for token in doc:
        if token.is_alpha and not token.is_stop and token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV'}:
            my_list.append(token.lemma_.lower())

    doc.user_data['tokens'] = my_list
    return doc


# Add it to the pipeline before "ner"
nlp.remove_pipe("simple_preprocessor")
nlp.add_pipe("simple_preprocessor", before="ner")

<function __main__.simple_preprocessor(doc)>

In [131]:
text = "SpaCy pipelines make it easy to build robust NLP systems."
doc = nlp(text)
print(doc.user_data["tokens"])

['spacy', 'pipeline', 'easy', 'build', 'robust', 'system']


## 🔹 7.3 Speed it up by disabling components

For pure preprocessing you don’t need parsing or NER. Use `select_pipes` to disable them:

In [132]:
# you code here

### ✅ Answer

In [133]:
with nlp.select_pipes(disable=['parser', 'ner']):
    doc_fast = nlp(text)
    fast_tokens = doc_fast.user_data['tokens']

print(fast_tokens)

['spacy', 'pipeline', 'easy', 'build', 'robust', 'system']


## 🔹 7.4 Challenge

- Extend your preprocessor to also remove tokens shorter than 3 characters.
- Measure the time difference between running with and without disabling “parser” and “ner.”
- Add an optional argument to your component so users can choose which POS tags to keep.

### ✅ Answer

In [134]:
@Language.component('simple_preprocessor2')
def simple_preprocessor2(doc):
    my_list = []
    for token in doc:
        if token.is_alpha and not token.is_stop and token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV'} and len(token) >= 5:
            my_list.append(token.lemma_.lower())

    doc.user_data['tokens2'] = my_list
    return doc


# Replace the first preprocessor
nlp.remove_pipe('simple_preprocessor')
nlp.add_pipe('simple_preprocessor2', before='ner')

# Test
doc2 = nlp(text)
print(doc2.user_data['tokens2'])

['spacy', 'pipeline', 'build', 'robust', 'system']


In [135]:
import time

# Full pipeline
start_full = time.perf_counter()
doc_full = nlp(text)
full_tokens = doc_full.user_data['tokens2']
time_full = time.perf_counter() - start_full
print("Full pipeline tokens:", full_tokens)

# Disable parser + ner
start_fast = time.perf_counter()
with nlp.select_pipes(disable=['parser', 'ner']):
    doc_fast = nlp(text)
    fast_tokens = doc_fast.user_data['tokens2']
time_fast = time.perf_counter() - start_fast
print("Fast pipeline tokens:", fast_tokens)

print(f'Full pipeline: {time_full:.6f}s, Without parser+ner: {time_fast:.6f}s')

Full pipeline tokens: ['spacy', 'pipeline', 'build', 'robust', 'system']
Fast pipeline tokens: ['spacy', 'pipeline', 'build', 'robust', 'system']
Full pipeline: 0.003256s, Without parser+ner: 0.001307s


In [136]:
from spacy.language import Language


@Language.factory(
    'advanced_preprocessor',
    default_config={'pos_to_keep': ['NOUN', 'VERB'], 'min_length': 4}
)
# ValueError: [E964] The pipeline component factory for 'advanced_preprocessor' needs to have the following named arguments, which are passed in by spaCy:
# - nlp: receives the current nlp object and lets you access the vocab
# - name: the name of the component instance, can be used to identify the component, output losses etc.
def create_advanced_preprocessor(nlp, name, pos_to_keep, min_length):
    def advanced_preprocessor(doc):
        my_list = []
        for token in doc:
            if token.is_alpha and not token.is_stop and token.pos_ in set(pos_to_keep) and len(token) >= min_length:
                my_list.append(token.lemma_.lower())

        doc.user_data['advanced_tokens'] = my_list
        return doc

    return advanced_preprocessor


# Add to pipeline
nlp.add_pipe('advanced_preprocessor', before='ner')

# Test
doc3 = nlp(text)
print(doc3.user_data['advanced_tokens'])

['pipeline', 'build', 'system']
