# Pipelines

When `nlp` is run on a string, e.g. `doc = nlp("This is a sentence.")`. First the string is tokenized into a `Doc` object. spaCy then applies every component in the pipeline on `Doc`, in order, i.e. tagger, parser, ner, etc.


| Name      | Description             | Creates                                                   |
| --------- | ----------------------- | --------------------------------------------------------- |
| `tagger`  | POS tagger              | `Token.tag`                                               |
| `parser`  | dependency parser       | `Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks` |
| `ner`     | named entity recognizer | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             |
| `textcat` | text classifier         | `Doc.cats`                                                |

In [14]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [3]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x117b8b438>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x1193fcee8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x1193fcf48>)]


## Custom pipeline components

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the Doc and add more data to it.

The basic syntax to add to/customize the pipeline is as follows:

```
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)
```

You can also add one of the following arguments to indicate where to place this function in the pipeline:

| Argument | Description          | Example                                   |
| -------- | -------------------- | ----------------------------------------- |
| `last`   | If True, add last    | `nlp.add_pipe(component, last=True)`      |
| `first`  | If True, add first   | `nlp.add_pipe(component, first=True)`     |
| `before` | Add before component | `nlp.add_pipe(component, before='ner')`   |
| `after`  | Add after component  | `nlp.add_pipe(component, after='tagger')` |

In [21]:
# Example of adding custom component

# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print(f'Doc length: {len(doc)}')
    
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']


In [22]:
doc = nlp("Hello world!")

Doc length: 3


## Simple Example - Doc length

In [20]:
import spacy

# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("Oh, hello there!")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.


## Complex Example - `PhraseMatcher` + add to doc.ents

Write a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents

Steps:

1. Define the custom component and apply the matcher to the doc.
2. Create a Span for each match, assign the label ID for 'ANIMAL' and overwrite the doc.ents with the new spans.
3. Add the new component to the pipeline after the 'ner' component.
4. Process the text and print the entity text and entity label for the entities in doc.ents.

In [25]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)

matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]


In [26]:
# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]
