### Processing Pipelines

spaCy ships with the following built-in pipeline components:
1. The part-of-speech tagger sets the token dot tag attribute.
2. The dependency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.
3. The named entity recognizer adds the detected entities to the doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.
4. Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property. Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

In [2]:
# Import the English language class and create the nlp object
import spacy
from spacy.lang.en import English

nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("This is a table.")

print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [3]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x126a2fc90>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x1062f47c0>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x1269b60c0>)]


In [9]:
for token in doc:
    print(token.text , token.pos_)

This DET
is VERB
a DET
table NOUN
. PUNCT


#### Custom Pipeline Components

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the Doc and add more data to it. \
Custom components are executed automatically when you call the nlp object on a text. \
Fundamentally, a pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline.

In [12]:
#Example: To add a pipeline component calculating the length of token

def custom_component(doc):
    print('Doc Length: ', len(doc))
    return doc

nlp.add_pipe(custom_component, first=True)

print(nlp.pipe_names)

['custom_component', 'tagger', 'parser', 'ner']


In [13]:
doc = nlp("Hello world!")

Doc Length:  3


In [14]:
# Example: Length of Doc

def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.
