# [Chapter 3 (Processing Pipelines)](https://course.spacy.io/en/chapter3)
These are my notes for the third chapter of the advanced NLP [course](https://course.spacy.io/en/) provided by spaCy. 

In [2]:
import spacy

This chapter contains:
- What happens under the hood when you process text
- How to write your own components and add them to the pipeline
- How to add custom attributes

### 3.1: Processing Pipelines
A processing pipeline is a series of functions applied to a document to add attributes like part-of-speech tags, dependency labels, or named entities. 

#### What happens when you call nlp?
First, the tokenizer is applied to turn the string of text into a `Doc` object. Next, a series of pipeline components is applied to the object in order. In this case, the tagger, then the parser, and then the entity recognizer. Afterwards, the object itself is returned. 

The Pipeline for each model is defined in the models's `config.cfg` file. This file tells spaCy which components to instantiate and how to configure them. These built-in components that make predictions also need binary data, which is included in the pipeline package and is loaded into the component when you load the pipeline.

In [6]:
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names) # you can see the names of the components like this
print(nlp.pipeline) # this returns the name of the component and the function the component applies to the document

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000001B48EAF5040>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000001B48EAF53A0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000001B48E969D60>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000001B48EB6B080>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001B48EB50E40>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x000001B48E2B70B0>)]


### 3.4: Custom Pipeline Components
Custom pipeline components let you add your own function to the spaCy pipeline and add more data to the `Doc` object. You can use your own custom functions/components to add custom data to the document and its tokens, or to update built-in attributes. 

Fundamentally, a component is a function/callable that takes a `Doc`, modifies it and returns it to be processed by the next component.

In [None]:
from spacy.language import Language

@Language.component("custom_component") # register it using this decorator
def custom_component_function(doc): # take the doc as an argument
    # Change/process the document 
    return doc # make sure to return the doc, so it's processed by the next component

nlp.add_pipe("custom_component") # add the component to the pipeline
# the above method takes at least one arg - the component's name. 

In the `add_pipe` function you can specify where in the pipeline should the component be. 

In [None]:
nlp.add_pipe("custom_component", last=True) # add the component to the end - default behavior
nlp.add_pipe("custom_component", first=True) # add the component to the front
nlp.add_pipe("custom_component", before="ner") # add the component before another component - ner in this case
nlp.add_pipe("custom_component", after="tagger") # add the component after another component - tagger in this case

Note that you can't tweak other components with this.

### 3.8: Extension Attributes
You can set custom attributes to the doc, its tokens/spans. It can be added once or computer dynamically. Custom attributes are available via the `._` property, to make it clear they are custom.

In [7]:
from spacy.tokens import Doc, Token, Span

# First, you gotta register the extension with the associated component:
Doc.set_extension("title", default=None) # first arg is the attribute name, other args define how the property
Token.set_extension("is_color", default=False) # can be computed or what the default value should be
Span.set_extension("has_color", default=False)

In [None]:
# You can then access them as below:
doc._.title = "My document"
token._.is_color = True
span._.has_color = False

Three types of extensions:
1. Attribute Extensions
2. Property Extensions
3. Method Extensions

#### Attribute Extensions
Attribute extensions set a default value that can be overwritten.

In [9]:
from spacy.tokens import Token
# Set extension on the Token with default value
Token.set_extension("is_color", default=False, force=True) # force was added to override the previous extension

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

#### Property Extensions
Property extensions work like properties in Python: they can define a getter function and an optional setter. The getter funciton is only called when you retrieve the attribute. This allows you to set the value dynamically and take custom attributes into account. Getter functions take one argument - the object. We provide the function via the getter keyword and register the extension. 

In [11]:
# Define a getter function:
def get_is_color(token): # take the object as an argument
    colors = ["red", "yellow", "blue"]
    return token.text in colors

Token.set_extension("is_color", getter=get_is_color, force=True) # pass the function as a getter

doc = nlp("The sky is blue")
doc[3]._.is_color # access the custom attribute via _

True

`Span` extensions should almost always use a getter. Otherwise, you would need to update every possible span in the document. Since you can have a lot of spans, it's best to calculate these things at runtime, so use a property extension.


#### Method Extensions
Method extensions make the extension attribute a callable method, allowing you to pass one or more arguments to it and compute attribute values dynamically, based on a certain argument/setting. 

In [12]:
def has_token(doc, token_text): # first argument is the object still, and other args are also possible
    in_doc = token_text in [token.text for token in doc]
    return in_doc

Doc.set_extension("has_token", method=has_token) # register the extension, and pass the method via the method arg

doc = nlp("The sky is blue")
print(doc._.has_token("blue")) # it's a callable, and you only need to pass the other arguments, not the object
print(doc._.has_token("cloud"))

True
False


### 3.13: Scaling and Performance
There are some tricks that can be used to optimize the performance of the components in the pipeline. 

If you need to process a lot of texts and create a lot of `Doc` objects in a row, the `nlp.pipe` method can speed this up significantly. It processes texts as a stream and yields `Doc` objects. It's much faster since it bunches up the texts. Since the method is a generator, it yields the objects, so to get a list of `Doc`'s, make sure to convert it into a list.

In [None]:
# BAD:
docs = [nlp(text) for text in texts]
# GOOD:
docs = list(nlp.pipe(texts))

`nlp.pipe` is also useful for passing additional metadata, since it supports passing text-context/metadata tuples/pairs:

In [13]:
data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context["page_number"], context["id"])

This is a text 15 1
And another text 16 2


This context/metadata can even become custom attributes

In [14]:
from spacy.tokens import Doc

Doc.set_extension("id", default=None)
Doc.set_extension("page_number", default=None)

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context["id"]
    doc._.page_number = context["page_number"]

#### Using only the Tokenizer

Sometimes you might have the model loaded to do other processing, but you only need to the tokenizer for one particular text. Running the entire pipeline is slow, because you'll be getting predictions from the model you won't need. 

In [15]:
# BAD:
doc = nlp("Hello World!")
# GOOD:
doc = nlp.make_doc("Hello World!") # this will return a tokenized doc, but won't apply any other components

#### Disabling Pipeline Components
You can disable certain components of the pipeline in case you don't need them. 

In [None]:
# Disable tagger and parser
with nlp.select_pipes(disable=["tagger", "parser"]):
    doc = nlp(text)
    print(doc.ents)
# Since we are only using context manager, after we are done, other components will be automatically enabled
# Also, accepts the enable keyword to only enable a few components