# <center> Processing Pipelines

A series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities.
    
Pipeline components provided by spaCy

#### What happens when you call nlp?
    
> doc = nlp('This is a sentence.') 
    
- The tokenizer is applied to turn the string of text into a Doc object:
    - A series of pipeline components is applied to the Doc in order (tagger->parser->ner) and then the process doc is returned.
    
<img src="https://d33wubrfki0l68.cloudfront.net/16b2ccafeefd6d547171afa23f9ac62f159e353d/48b91/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg" width="800" height="800">    

|Name| Description| Creates|Obs|
|---|---|---|---|
|tagger| Part-of-speech tagger| Token.tag||
|parser| Dependency parser |Token.dep , Token.head , Doc.sents , Doc.noun_chunks| Responsible of detecting sentences and base noun phrases|
|ner| Named entity recognizer |Doc.ents , Token.ent_iob , Token.ent_type| Adds detected entities to the doc and also indicate if a token is part of an entity or not|
|textcat| Text classier |Doc.cats| Not included in any of the pre-trained models by default|
    
   
    
#### Under the hood
- Pipeline dened in model's meta.json in order
- Built-in components need binary data to make predictions
    
#### Pipeline attributes
- nlp.pipe_names : list of pipeline component names
- nlp.pipeline : list of (name, component) tuples

In [1]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
nlp.pipeline

['tagger', 'parser', 'ner']


[('tagger', <spacy.pipeline.pipes.Tagger at 0x25c1ba74308>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x25c1a6fffa8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x25c1ba79048>)]

# <center> Custom pipeline components

This let you add your own function to the spaCy pipeline that is executed qhen you call the nlp object on a text

#### Why custom components?
    
- Make a function execute automatically when you call nlp
- Add your own metadata to documents and tokens
- Updating built-in attributes like doc.ents

#### Anatomy of a component

- Function that takes a doc , modies it and returns it
- Can be added using the nlp.add_pipe method

|Argument |Description |Example|
|---|---|---|
|last |If True , add last| nlp.add_pipe(component, last=True)|
|first |If True , add first| nlp.add_pipe(component, first=True)|
|before |Add before component| nlp.add_pipe(component, before='ner')|
|after |Add after component| nlp.add_pipe(component, after='tagger') |   

Example of a simple component:

In [2]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc

In [3]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')
  
# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner']


In [4]:
# Process a text
doc = nlp('This is a sentence')

This document is 4 tokens long.


#### Complex components
Using PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents

In [5]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

matcher = PhraseMatcher(nlp.vocab)
ANIMALS=['cat', 'dog', 'Golden Retriever', 'turtle', 'tortoise','rabbit']

patterns = list(nlp.pipe(ANIMALS))
matcher.add('ANIMALS', None, *patterns)

This document is 1 tokens long.
This document is 1 tokens long.
This document is 2 tokens long.
This document is 1 tokens long.
This document is 1 tokens long.
This document is 1 tokens long.


In [6]:
# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL')
             for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner', 'animal_component']


In [7]:
# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

This document is 8 tokens long.
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


# <center> Extension attributes
    
- Add custom metadata to documents,tokens and spans
- Accessible via the ._ property
- registered on the global Doc , Token or Span using the set_extension method
    
#### Extension attribute types
    
1. Attribute extensions
    - Set a default value that can be overwritten

In [8]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

This document is 5 tokens long.
[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


2. Property extensions
    - Work like properties in Python
    - Define a getter and an optional setter function
    - Getter only called when you retrieve the attribute value
    - Span extensions should almost always use a getter

In [9]:
# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]
  
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

This document is 9 tokens long.
reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [10]:
# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number)

# Process the text and check the custom has_number attribute 
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

This document is 9 tokens long.
has_number: True


3. Method extensions
    - Makes a extension attribute callable
    - Assign a function that becomes available as an object method
    - Lets you pass arguments to the extension function

In [11]:
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

This document is 8 tokens long.
<strong>Hello world</strong>


#### Entities and extensions
Combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

In [12]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')

In [13]:
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


#### Components with extensions
Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.

In [14]:
COUNTRIES=['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

In [15]:
import json
##using capitals which was copy on a json file
with open('datasets/capitals.json') as json_file:
    capitals=json.load(json_file)
capitals['Czech Republic']

'Prague'

In [16]:
matcher = PhraseMatcher(nlp.vocab)

patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRIES', None, *patterns)

In [17]:
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'countries_component']


In [18]:
# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: capitals.get(span.text)


# Register the Span extension attribute 'capital' with the getter get_capital 
Span.set_extension('capital',getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


# <center> Scaling and performance
    
#### Processing large volumes oftext
- Use nlp.pipe method
- Processes texts as a stream, yields Doc objects
- Much faster than calling nlp on each text because it batches up the text

    BAD:
>docs = [nlp(text) for text in LOTS_OF_TEXTS]

    GOOD:
>docs = list(nlp.pipe(LOTS_OF_TEXTS))
    
#### Passing in context
- Setting as_tuples=True on nlp.pipe lets you pass in (text, context) tuples
- Yields (doc, context) tuples
- Useful for associating metadata with the doc
    
#### Using only the tokenizer
- don't run the whole pipeline!
- Use nlp.make_doc to turn a text in to a Doc object

    BAD:
>doc = nlp("Hello world")

    GOOD:
>doc = nlp.make_doc("Hello world!")
     
#### Disabling pipeline components
- Use nlp.disable_pipes to temporarily disable one or more pipes
    
```
    # Disable tagger and parser
    with nlp.disable_pipes('tagger','parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)
```    
    
- restores them after the with block
- only runs the remaining components

In [19]:
TWEETS=['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

In [20]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')

#### Ejemplos sin USAR NLP.PIPE

In [21]:
print('BAD PERFORMANCE')
# Process the texts and print the adjectives
for text in TWEETS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == 'ADJ'])

BAD PERFORMANCE
['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['BAD']
['terrible', 'payin']


In [22]:
# Process the texts and print the entities
docs = [nlp(text) for text in TWEETS]
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) (@McDonalds,) (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () ()


In [23]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = [nlp(person) for person in people]
patterns

[David Bowie, Angela Merkel, Lady Gaga]

#### Ejemplos USANDO NLP.PIPE

In [24]:
print('CLEAN PERFORMANCE')
# Process the texts and print the adjectives
for doc in nlp.pipe(TWEETS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

CLEAN PERFORMANCE
['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['BAD']
['terrible', 'payin']


In [25]:
# Process the texts and print the entities
docs = list(nlp.pipe(TWEETS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) (@McDonalds,) (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () ()


In [26]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))
patterns

[David Bowie, Angela Merkel, Lady Gaga]

#### Processing data with context

In [27]:
DATA=[('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

In [28]:
# Import the Doc class and register the extensions 'author' and 'book'
from spacy.tokens import Doc

# Register the Doc extension 'author' (default None)
Doc.set_extension('author',default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book',default=None)

for doc, context in nlp.pipe(DATA,as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



#### Selective processing
Using the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text. 

In [29]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [30]:
# Disable the tagger and parser
with nlp.disable_pipes('tagger','parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(Chick-fil-A, American, College Park, Georgia)
