# Processing pipelines

This notebook is about Spacy pipeline feature. A pipeline is a series of functions applied to a doc to add attributes like part-of-speech tags, dependency labels or named entities.

What does spaCy do when you call nlp on a string of text?

doc = nlp("this is a test sentence.")

## Inspecting the pipeline

Let’s inspect the small English pipeline!

* Print the names of the pipeline components using nlp.pipe_names.
* Print the full pipeline of (name, component) tuples using nlp.pipeline.

In [None]:
import spacy

# Load the en_core_web_sm pipeline
nlp = ____

# Print the names of the pipeline components
print(____.____)

# Print the full pipeline of (name, component) tuples
print(____.____)

In [1]:
import spacy

# Load the en_core_web_sm pipeline
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7ae28fe2fca0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7ae28fe2fbe0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7ae34064f450>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7ae28fdbbb00>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7ae2900d6b40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7ae340cbaff0>)]


## Custome pipelilne components

Now that you know how spaCy's pipeline works, let's take a look at another very powerful feature: custom pipeline components.

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the doc and add more data to it.



## Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

* Complete the component function with the doc’s length.
* Add the "length_component" to the existing pipeline as the first component.
* Try out the new pipeline and process any text with the nlp object – for example “This is a sentence.”.

In [None]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = ____
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    ____


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
____.____(____, ____=____)
print(nlp.pipe_names)

# Process a text
doc = ____

In [None]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

## Complex components

In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

* Define the custom component and apply the matcher to the doc.
* Create a Span for each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.
* Add the new component to the pipeline after the "ner" component.
* Process the text and print the entity text and entity label for the entities in doc.ents.

In [None]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = ____
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(____, ____, ___, label=____) for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


In [None]:
# Add the component to the pipeline after the "ner" component
____.____(____, ____=____)
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(____, ____) for ent in ____])

In [2]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## Extension attributes

In this lesson, you'll learn how to add custom attributes to the Doc, Token and Span objects to store custom data.

## Step 1

* Use Token.set_extension to register "is_country" (default False).
* Update it for "Pakistan" and print it for all tokens.

In [None]:
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Register the Token extension attribute "is_country" with the default value False
____.____(____, ____=____)

# Process the text and set the is_country attribute to True for the token "Pakistan"
doc = nlp("I live in Pakistan.")
____ = True

# Print the token text and the is_country attribute for all tokens
print([(____, ____) for token in doc])

In [1]:
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Pakistan"
doc = nlp("I live in Pakistan.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Pakistan', True), ('.', False)]


## Step 2
* Use Token.set_extension to register "reversed" (getter function get_reversed).
* Print its value for each token.

In [None]:
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension "reversed" with the getter get_reversed
____.____(____, ____=____)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for ____ in ____:
    print("reversed:", ____)

In [2]:
import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


## Setting extension attributes - Complex Case

Let’s try setting some more complex attributes using getters and method extensions.

## Part 1
* Complete the get_has_number function .
* Use Doc.set_extension to register "has_number" (getter get_has_number) and print its value.



In [None]:
import spacy
from spacy.tokens import Doc

nlp = spacy.blank("en")

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(____ for token in doc)


# Register the Doc property extension "has_number" with the getter get_has_number
____.____(____, ____=____)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2024.")
print("has_number:", ____)

In [3]:
import spacy
from spacy.tokens import Doc

nlp = spacy.blank("en")

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension "has_number" with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2024.")
print("has_number:", doc._.has_number)

has_number: True


## Part 2
* Use Span.set_extension to register "to_html" (method to_html).
* Call it on doc[0:2] with the tag "strong".

In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"


# Register the Span method extension "to_html" with the method to_html
____.____(____, ____=____)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(____)

In [4]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"


# Register the Span method extension "to_html" with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


## Entities and extension

In this exercise, you’ll combine custom extension attributes with the statistical predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

* Complete the get_wikipedia_url getter so it only returns the URL if the span’s label is in the list of labels.
* Set the Span extension "wikipedia_url" using the getter get_wikipedia_url.
* Iterate over the entities in the doc and output their Wikipedia URL.

In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if ____ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using the getter get_wikipedia_url
____.____(____, ____=____)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(____, ____)

In [6]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


## Components with extensions

Extension attributes are especially powerful if they’re combined with custom pipeline components. In this exercise, you’ll write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.

A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable CAPITALS.

* Complete the countries_component_function and create a Span with the label "GPE" (geopolitical entity) for all matches.
* Add the component to the pipeline.
* Register the Span extension attribute "capital" with the getter get_capital.
* Process the text and print the entity text, entity label and entity capital for each entity span in doc.ents.

In [1]:
import json
import spacy
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

with open("capitals.json", encoding="utf8") as f:
    CAPITALS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", list(nlp.pipe(COUNTRIES)))


@Language.component("countries_component")
def countries_component_function(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe("countries_component")
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes

doc = nlp("Pakistan may help China to expand its global out-reach")

print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Pakistan', 'GPE', 'Islamabad'), ('China', 'GPE', 'Beijing')]


## Processing stream

In this exercise, you’ll be using nlp.pipe for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

## Part 1
* Rewrite the example to use nlp.pipe. Instead of iterating over the texts and processing them, iterate over the doc objects yielded by nlp.pipe.

In [None]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == "ADJ"])

In [2]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


## Part 2
* Rewrite the example to use nlp.pipe. Don’t forget to call list() around the result to turn it into a list.

In [3]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
docs = [nlp(text) for text in TEXTS]
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () (This morning,)


## Selective processing

In this exercise, you’ll use the nlp.make_doc and nlp.select_pipes methods to only run selected components when processing a text.

## Part 1
* Rewrite the code to only tokenize the text using nlp.make_doc.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


## Part 2
* Disable the tagger and lemmatizer using the nlp.select_pipes method.
Process the text and print all entities in the doc.

In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and lemmatizer
with nlp.select_pipes(disable=["tagger", "lemmatizer"]):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(Chick, American, College Park, Georgia)
