<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>spaCy Pipelines</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Natural Language Processing with spaCy)</span></div>

## Table of Contents
1. [spaCy Pipeline Basics](#section-1)
2. [Pipeline Structure and Components](#section-2)
3. [Adding Pipes and Optimization](#section-3)
4. [Analyzing Pipeline Components](#section-4)
5. [The spaCy EntityRuler](#section-5)
6. [Adding EntityRuler to the Pipeline](#section-6)
7. [EntityRuler in Action: Integration Strategies](#section-7)
8. [Regular Expressions (RegEx) with spaCy](#section-8)
9. [RegEx Implementation in spaCy](#section-9)
10. [The spaCy Matcher](#section-10)
11. [The spaCy PhraseMatcher](#section-11)
12. [Conclusion](#section-12)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. spaCy Pipeline Basics</span><br>

spaCy is a powerful library for Natural Language Processing (NLP). At its core, spaCy operates using a processing pipeline.

### How it works
1.  **Tokenization**: spaCy first tokenizes the text to produce a `Doc` object.
2.  **Processing**: The `Doc` is then processed in several different steps by the processing pipeline.

To use spaCy, we typically load a pre-trained model (like `en_core_web_sm`) which contains the pipeline definitions.



In [None]:
# Import the spaCy library
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Define example text
example_text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text to create a Doc object
doc = nlp(example_text)

# Verify the object type
print(type(doc))



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Pipeline Structure and Components</span><br>

A **pipeline** is a sequence of pipes, or "actors," that perform operations on the data.

### The NER Pipeline
A typical spaCy Named Entity Recognition (NER) pipeline involves the following steps:
1.  **Tokenization**: Splitting text into words/punctuation.
2.  **Named entity identification**: Locating the entities.
3.  **Named entity classification**: Assigning labels (e.g., ORG, GPE) to entities.

The flow looks like this:
`Input text` -> `Tokenizer` -> `EntityRuler` -> `EntityLinker` -> `Doc with annotated entities`



In [None]:
import spacy

# Load model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Accessing named entities processed by the pipeline
# We use list comprehension to extract text from doc.ents
print([ent.text for ent in doc.ents])



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Adding Pipes and Optimization</span><br>

We can customize pipelines by adding specific components. One such component is the `sentencizer`, which performs sentence segmentation.

### Performance Comparison
Below, we compare the processing time of a full pipeline versus a blank pipeline with only a `sentencizer`.

#### 1. Using the Full Model (`en_core_web_sm`)
This model includes tagger, parser, and NER, which takes more time.



In [None]:
import spacy
import time

# Create a large text dataset
text = " ".join(["This is a test sentence."]*10000)

# Load the full model
en_core_sm_nlp = spacy.load("en_core_web_sm")

# Start timer
start_time = time.time()

# Process text
doc = en_core_sm_nlp(text)

# Calculate duration
duration = (time.time() - start_time) / 60.0

print(f"Finished processing with en_core_web_sm model in {round(duration, 5)} minutes")



#### 2. Using a Blank Model with `sentencizer`
If we only need sentence segmentation, we can create a blank model and add the `sentencizer` pipe. This is significantly faster.



In [None]:
import spacy
import time

# Create a large text dataset
text = " ".join(["This is a test sentence."]*10000)

# Create a blank English model
blank_nlp = spacy.blank("en")

# Add the sentencizer pipe
blank_nlp.add_pipe("sentencizer")

# Start timer
start_time = time.time()

# Process text
doc = blank_nlp(text)

# Calculate duration
duration = (time.time() - start_time) / 60.0

print(f"Finished processing with blank model in {round(duration, 5)} minutes")



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Using a blank model with specific pipes is highly efficient when you do not need the full capabilities (like dependency parsing or NER) of the pre-trained models. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Analyzing Pipeline Components</span><br>

To understand what is happening inside a pipeline, spaCy provides the `nlp.analyze_pipes()` method.

This method determines:
*   Attributes that pipeline components set.
*   Scores a component produces during training.
*   Presence of all required attributes.

Setting `pretty=True` prints a readable table.



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Analyze the pipeline and print the results
analysis = nlp.analyze_pipes(pretty=True)



### Pipeline Overview Table
The output of the analysis typically looks like this:

| # | Component | Assigns | Requires | Scores | Retokenizes |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0 | tok2vec | doc.tensor | | | False |
| 1 | tagger | token.tag | | tag_acc | False |
| 2 | parser | token.dep, token.head, token.is_sent_start, doc.sents | | dep_uas, dep_las, sents_p, sents_r, sents_f | False |
| 3 | attribute_ruler | | | | False |
| 4 | lemmatizer | token.lemma | | lemma_acc | False |
| 5 | ner | doc.ents, token.ent_iob, token.ent_type | | ents_f, ents_p, ents_r | False |
| 6 | entity_linker | token.ent_kb_id | doc.ents, doc.sents, token.ent_iob, token.ent_type | nel_micro_f, nel_micro_r, nel_micro_p | False |

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. The spaCy EntityRuler</span><br>

The `EntityRuler` is a component that adds named entities to a `Doc` container based on pattern matching. It can be used on its own or combined with the statistical `EntityRecognizer`.

### Types of Patterns

1.  **Phrase entity patterns**: For exact string matches.


In [None]:
    {"label": "ORG", "pattern": "Microsoft"}



2.  **Token entity patterns**: A list of dictionaries, where each dictionary describes one token.


In [None]:
    {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Adding EntityRuler to the Pipeline</span><br>

To use the EntityRuler, we add it to the pipeline using `.add_pipe()` and then add patterns using `.add_patterns()`.



In [None]:
import spacy

# Create a blank English model
nlp = spacy.blank("en")

# Add the EntityRuler pipe
entity_ruler = nlp.add_pipe("entity_ruler")

# Define patterns
patterns = [
    {"label": "ORG", "pattern": "Microsoft"},
    {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
]

# Add patterns to the ruler
entity_ruler.add_patterns(patterns)

# Process text
doc = nlp("Microsoft is hiring software developer in San Francisco.")

# Print entities found
print([(ent.text, ent.label_) for ent in doc.ents])



**Expected Output:**


```
[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]
```



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. EntityRuler in Action: Integration Strategies</span><br>

The `EntityRuler` integrates with other spaCy components. A key consideration is **where** in the pipeline you place the ruler.

### 1. Default Model Behavior (No EntityRuler)
Without the ruler, the statistical model might misclassify entities based on context.



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Manhattan associates is a company in the U.S.")

# The model might mistake 'Manhattan' for a GPE (Location) instead of part of an ORG
print([(ent.text, ent.label_) for ent in doc.ents])
# Output: [('Manhattan', 'GPE'), ('U.S.', 'GPE')]



### 2. EntityRuler AFTER existing NER
If added `after='ner'`, the ruler will only find entities that the NER model missed or didn't overwrite.



In [None]:
nlp = spacy.load("en_core_web_sm")

# Add ruler after NER
ruler = nlp.add_pipe("entity_ruler", after='ner')

patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
# Output might still be GPE because NER ran first and labeled 'Manhattan'



### 3. EntityRuler BEFORE existing NER
If added `before='ner'`, the ruler's matches take precedence. The NER model will respect the existing entities found by the ruler.



In [None]:
nlp = spacy.load("en_core_web_sm")

# Add ruler BEFORE NER
ruler = nlp.add_pipe("entity_ruler", before='ner')

patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
# Output: [('Manhattan associates', 'ORG'), ('U.S.', 'GPE')]



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Regular Expressions (RegEx) with spaCy</span><br>

Regular Expressions (RegEx) are used for complex string matching patterns. They are useful for rule-based information extraction.

### Strengths and Weaknesses

| Pros | Cons |
| :--- | :--- |
| Enables writing robust rules to retrieve information | Syntax is challenging for beginners |
| Can find many types of variance in strings | Requires knowledge of all ways a pattern may be mentioned |
| Runs fast | |
| Supported by most programming languages | |

### RegEx in Standard Python
Python uses the `re` library.



In [None]:
import re

# Define a pattern for phone numbers (XXX-XXX-XXXX)
pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."

# Find matches
iter_matches = re.finditer(pattern, text)

for match in iter_matches:
    start_char = match.start()
    end_char = match.end()
    print("Start character:", start_char, "| End character:", end_char, 
          "| Matching text:", text[start_char:end_char])



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. RegEx Implementation in spaCy</span><br>

spaCy allows RegEx-like capabilities within its pipeline components: `Matcher`, `PhraseMatcher`, and `EntityRuler`. We can use token attributes like `SHAPE` to mimic RegEx behavior.



In [None]:
import spacy

text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy.blank("en")

# Define patterns using token attributes
# "ddd" represents 3 digits, "dddd" represents 4 digits
patterns = [{
    "label": "PHONE_NUMBER",
    "pattern": [
        {"SHAPE": "ddd"},
        {"ORTH": "-"},
        {"SHAPE": "ddd"},
        {"ORTH": "-"},
        {"SHAPE": "dddd"}
    ]
}]

# Add EntityRuler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Process text
doc = nlp(text)

# Print results
print([(ent.text, ent.label_) for ent in doc.ents])



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. The spaCy Matcher</span><br>

The `Matcher` class provides a readable, production-level alternative to complex RegEx strings. It matches sequences of tokens based on pattern dictionaries.

### Basic Matching


In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, this is our first day on campus.")

matcher = Matcher(nlp.vocab)

# Pattern: Case-insensitive match for "good" followed by "morning"
pattern = [{"LOWER": "good"}, {"LOWER": "morning"}]
matcher.add("morning_greeting", [pattern])

matches = matcher(doc)

for match_id, start, end in matches:
    print("Start token:", start, "| End token:", end, "| Matched text:", doc[start:end].text)



### Extended Syntax Support
The Matcher supports operators similar to Python's `in`, `not in`, and comparison operators.

| Attribute | Value Type | Description |
| :--- | :--- | :--- |
| `IN` | any type | Attribute value is a member of a list |
| `NOT_IN` | any type | Attribute value is *not* a member of a list |
| `==`, `>=`, `<=`, `>`, `<` | int, float | Comparison operators for equality or inequality checks |

#### Example: Using the `IN` Operator
Matching both "Good morning" and "Good evening".



In [None]:
doc = nlp("Good morning and good evening.")
matcher = Matcher(nlp.vocab)

# Pattern using IN operator
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]

matcher.add("morning_greeting", [pattern])
matches = matcher(doc)

for match_id, start, end in matches:
    print("Start token:", start, "| End token:", end, "| Matched text:", doc[start:end].text)



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 11. The spaCy PhraseMatcher</span><br>

The `PhraseMatcher` is optimized for matching large lists of phrases in a text. It is generally more efficient than `Matcher` when dealing with exact string matches.

### Basic Phrase Matching


In [None]:
from spacy.matcher import PhraseMatcher
import spacy

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

# List of terms to match
terms = ["Bill Gates", "John Smith"]

# Convert terms to Doc objects
patterns = [nlp.make_doc(term) for term in terms]

# Add patterns
matcher.add("PeopleOfInterest", patterns)

doc = nlp("Bill Gates met John Smith for an important discussion regarding importance of AI.")

matches = matcher(doc)

for match_id, start, end in matches:
    print("Start token:", start, "| End token:", end, "| Matched text:", doc[start:end].text)



### Using the `attr` Argument
We can configure the `PhraseMatcher` to match on specific token attributes, such as `LOWER` (case-insensitive) or `SHAPE`.



In [None]:
# Example 1: Case-insensitive matching
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["Government", "Investment"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)

doc = nlp("It was interesting to the investment division of the government.")
# This will match "investment" and "government" despite case differences

# Example 2: Shape matching (IP Addresses)
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
terms = ["110.0.0.0", "101.243.0.0"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)

doc = nlp("The tracked IP address was 234.135.0.0.")
# This will match "234.135.0.0" because it has the same shape as the terms



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 12. Conclusion</span><br>

In this notebook, we explored the powerful capabilities of **spaCy pipelines**.

**Key Takeaways:**
1.  **Pipelines**: spaCy processes text through a sequence of pipes. We can analyze these using `nlp.analyze_pipes()`.
2.  **Customization**: We can add custom components like `sentencizer` or `EntityRuler` to blank or existing models.
3.  **EntityRuler**: This component allows for rule-based entity recognition. Its placement (`before` or `after` NER) is critical for determining which entities take precedence.
4.  **Pattern Matching**:
    *   **RegEx**: Useful but complex; spaCy supports RegEx-like logic via token attributes.
    *   **Matcher**: A readable, token-based matching engine supporting extended syntax (`IN`, comparison).
    *   **PhraseMatcher**: Highly efficient for matching large lists of exact phrases.

**Next Steps:**
*   Experiment with combining `EntityRuler` and statistical models to improve accuracy on domain-specific data.
*   Utilize `PhraseMatcher` for large-scale keyword extraction tasks.
*   Explore custom pipeline components to perform specialized text processing tasks.
