## Processing pipelines

In [35]:
import spacy
import random

In [2]:
print(spacy.__version__)

2.1.0


### Inspecting the pipeline

In [2]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [5]:
# Print the names of the pipeline components
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [6]:
# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f1563938438>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f15623b8b88>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f15623b8be8>)]


### Custom components

Custom components are great for adding custom values to documents, tokens, spans and customizing the `doc.ents`.

In [9]:
#Define the custom component
def length_component(doc):
    #Get the doc's length
    doc_length = len(doc)
    
    print("This document is {} tokens long.".format(doc_length))
    #Return the doc
    return doc

#Load the small English mode;
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

In [14]:
doc = nlp('This is a sentence')

This document is 4 tokens long.


In [12]:
# Write a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
#Load the small English mode;
nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)

animal_patterns = nlp.pipe(['Golden Retriever', 'cat', 'turtle', 'Rattus norvegicus'])
matcher.add('ANIMAL', None, *animal_patterns)

# Define the custome component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    
    # Create a Span for each match and assign the label 'ANIMAL'
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp('I have a cat and a Golden Retriever')
print([(ent.text, ent.label_) for ent in doc.ents])

['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


### Extension attributes

We can add custom metadata to documents, tokens and spans. These are accessible via `._` property. This should be registered on the global `Doc`, `Token` or `Span` using the `set_extension` method.

There are 3 types 

1. Attribute extension - set a default value that can be overwritten
2. Property extensions - define a getter & an optional setter. getter only called when you *retrieve* the attrinute value. Span extensions should always use a getter
3. Method extensions - assign a **function** that becomes available as an object method. allows you pass **arguments** to the extension function

In [27]:
name = 'Mani'
print(''.join(reversed(name)))

inaM


In [30]:
from spacy.tokens import Token

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    #Alternate version return token.text[::-1]
    return ''.join(reversed(token.text))

## Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False, force=True)
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed, force=True)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp('I live in Spain.')
doc[3]._.is_country = True

print([(token.text, token._.is_country) for token in doc])

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")

# Print the token text and the is_country attribute for all tokens
for token in doc:
    print('reversed:', token._.reversed)


[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]
reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [33]:
from spacy.tokens import Token, Doc, Span

#Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
        return any(token.like_num for token in doc)
    
# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number, force=True)

# Process the text and check the custom has_number attribute 
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

has_number: True


In [34]:
from spacy.tokens import Token, Doc, Span

#Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))


<strong>Hello world</strong>


### Entities and extensions

Combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

In [37]:
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')

#Define the getter method
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text
    
# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url, force=True)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


### Components with extensions

Below is an example for adding structured data to the spacy pipeline

In [40]:
COUNTRIES = ['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

In [41]:
global COUNTRIES

In [43]:
capitals = {'Åland Islands': 'Mariehamn', 'Lesotho': 'Maseru', 'Trinidad and Tobago': 'Port of Spain', 'Yemen': "Sana'a", 'Christmas Island': 'Flying Fish Cove', 'Bouvet Island': '', 'Kyrgyzstan': 'Bishkek', 'New Caledonia': 'Nouméa', 'Hong Kong': 'City of Victoria', 'Uzbekistan': 'Tashkent', 'Cabo Verde': 'Praia', 'Bulgaria': 'Sofia', 'Bahrain': 'Manama', 'Kuwait': 'Kuwait City', 'Bhutan': 'Thimphu', 'Niue': 'Alofi', 'Mauritania': 'Nouakchott', 'French Polynesia': 'Papeetē', 'Sweden': 'Stockholm', 'Latvia': 'Riga', 'Marshall Islands': 'Majuro', 'Sierra Leone': 'Freetown', 'Senegal': 'Dakar', 'Cambodia': 'Phnom Penh', 'Kiribati': 'South Tarawa', 'Kazakhstan': 'Astana', 'San Marino': 'City of San Marino', 'Madagascar': 'Antananarivo', 'Virgin Islands (U.S.)': 'Charlotte Amalie', 'Bermuda': 'Hamilton', 'South Sudan': 'Juba', 'Turkmenistan': 'Ashgabat', "Côte d'Ivoire": 'Yamoussoukro', 'Eritrea': 'Asmara', 'Samoa': 'Apia', 'Puerto Rico': 'San Juan', 'Zambia': 'Lusaka', 'Israel': 'Jerusalem', 'Somalia': 'Mogadishu', 'Timor-Leste': 'Dili', 'Central African Republic': 'Bangui', 'Peru': 'Lima', 'Guinea': 'Conakry', 'Palestine, State of': 'Ramallah', 'Mozambique': 'Maputo', 'Guam': 'Hagåtña', 'Isle of Man': 'Douglas', 'Czech Republic': 'Prague', 'Angola': 'Luanda', 'Luxembourg': 'Luxembourg', 'Nicaragua': 'Managua', 'Liberia': 'Monrovia', 'Bangladesh': 'Dhaka', 'Jordan': 'Amman', 'Romania': 'Bucharest', 'Slovakia': 'Bratislava', 'Egypt': 'Cairo', 'French Southern Territories': 'Port-aux-Français', 'Maldives': 'Malé', 'Chile': 'Santiago', 'Uganda': 'Kampala', 'Anguilla': 'The Valley', 'Bonaire, Sint Eustatius and Saba': 'Kralendijk', 'Syrian Arab Republic': 'Damascus', 'Ethiopia': 'Addis Ababa', 'Paraguay': 'Asunción', 'American Samoa': 'Pago Pago', 'Iraq': 'Baghdad', 'Slovenia': 'Ljubljana', 'Malta': 'Valletta', 'Norfolk Island': 'Kingston', 'Finland': 'Helsinki', 'Tajikistan': 'Dushanbe', 'Palau': 'Ngerulmud', 'Nigeria': 'Abuja', 'Fiji': 'Suva', 'Honduras': 'Tegucigalpa', 'Qatar': 'Doha', 'New Zealand': 'Wellington', 'Grenada': "St. George's", 'Kenya': 'Nairobi', 'Mauritius': 'Port Louis', 'Cocos (Keeling) Islands': 'West Island', 'Spain': 'Madrid', 'Curaçao': 'Willemstad', 'Guinea-Bissau': 'Bissau', 'Republic of Kosovo': 'Pristina', 'Macedonia (the former Yugoslav Republic of)': 'Skopje', 'Iran (Islamic Republic of)': 'Tehran', 'Poland': 'Warsaw', 'Falkland Islands (Malvinas)': 'Stanley', 'Gibraltar': 'Gibraltar', 'United Kingdom of Great Britain and Northern Ireland': 'London', 'China': 'Beijing', 'Ireland': 'Dublin', 'Sint Maarten (Dutch part)': 'Philipsburg', 'Greenland': 'Nuuk', 'United Arab Emirates': 'Abu Dhabi', 'Nauru': 'Yaren', 'Italy': 'Rome', 'Congo': 'Brazzaville', 'Zimbabwe': 'Harare', 'Réunion': 'Saint-Denis', 'Tanzania, United Republic of': 'Dodoma', "Lao People's Democratic Republic": 'Vientiane', 'Svalbard and Jan Mayen': 'Longyearbyen', 'Brazil': 'Brasília', 'Western Sahara': 'El Aaiún', 'Taiwan': 'Taipei', 'Heard Island and McDonald Islands': '', 'Liechtenstein': 'Vaduz', 'Burkina Faso': 'Ouagadougou', 'Philippines': 'Manila', 'Togo': 'Lomé', 'Singapore': 'Singapore', 'Cayman Islands': 'George Town', 'Saint Lucia': 'Castries', 'Saudi Arabia': 'Riyadh', 'Monaco': 'Monaco', 'Cameroon': 'Yaoundé', 'Wallis and Futuna': 'Mata-Utu', 'South Africa': 'Pretoria', 'Costa Rica': 'San José', 'Mexico': 'Mexico City', 'Guadeloupe': 'Basse-Terre', 'Serbia': 'Belgrade', 'Saint Vincent and the Grenadines': 'Kingstown', 'Papua New Guinea': 'Port Moresby', 'United States Minor Outlying Islands': '', 'Rwanda': 'Kigali', 'Suriname': 'Paramaribo', 'Russian Federation': 'Moscow', 'Lebanon': 'Beirut', 'Saint Helena, Ascension and Tristan da Cunha': 'Jamestown', 'Malawi': 'Lilongwe', 'Pakistan': 'Islamabad', 'Namibia': 'Windhoek', 'Niger': 'Niamey', 'France': 'Paris', 'Solomon Islands': 'Honiara', 'Ghana': 'Accra', 'Sudan': 'Khartoum', 'United States of America': 'Washington, D.C.', 'Greece': 'Athens', 'Botswana': 'Gaborone', 'Belgium': 'Brussels', 'Faroe Islands': 'Tórshavn', 'Ukraine': 'Kiev', 'Moldova (Republic of)': 'Chișinău', 'Oman': 'Muscat', "Korea (Democratic People's Republic of)": 'Pyongyang', 'Albania': 'Tirana', 'India': 'New Delhi', 'Viet Nam': 'Hanoi', 'Mongolia': 'Ulan Bator', 'Afghanistan': 'Kabul', 'Tokelau': 'Fakaofo', 'Montenegro': 'Podgorica', 'Colombia': 'Bogotá', 'Equatorial Guinea': 'Malabo', 'Croatia': 'Zagreb', 'Cuba': 'Havana', 'Panama': 'Panama City', 'Cyprus': 'Nicosia', 'Burundi': 'Bujumbura', 'Canada': 'Ottawa', 'Morocco': 'Rabat', 'Virgin Islands (British)': 'Road Town', 'Indonesia': 'Jakarta', 'Tunisia': 'Tunis', 'Ecuador': 'Quito', 'Libya': 'Tripoli', 'Barbados': 'Bridgetown', 'Seychelles': 'Victoria', 'Brunei Darussalam': 'Bandar Seri Begawan', 'Lithuania': 'Vilnius', 'Congo (Democratic Republic of the)': 'Kinshasa', 'Bolivia (Plurinational State of)': 'Sucre', 'Norway': 'Oslo', 'Swaziland': 'Lobamba', 'Australia': 'Canberra', 'Benin': 'Porto-Novo', 'Mayotte': 'Mamoudzou', 'Turkey': 'Ankara', 'Holy See': 'Rome', 'Dominican Republic': 'Santo Domingo', 'Andorra': 'Andorra la Vella', 'Dominica': 'Roseau', 'Montserrat': 'Plymouth', 'Vanuatu': 'Port Vila', 'Jersey': 'Saint Helier', 'Gabon': 'Libreville', 'Bosnia and Herzegovina': 'Sarajevo', 'Antarctica': '', 'Japan': 'Tokyo', 'Turks and Caicos Islands': 'Cockburn Town', 'Saint Pierre and Miquelon': 'Saint-Pierre', 'Thailand': 'Bangkok', 'Bahamas': 'Nassau', 'Sri Lanka': 'Colombo', 'Tonga': "Nuku'alofa", 'Korea (Republic of)': 'Seoul', 'Argentina': 'Buenos Aires', 'British Indian Ocean Territory': 'Diego Garcia', 'Iceland': 'Reykjavík', 'El Salvador': 'San Salvador', 'Germany': 'Berlin', 'Pitcairn': 'Adamstown', 'Comoros': 'Moroni', 'Azerbaijan': 'Baku', 'Switzerland': 'Bern', 'Georgia': 'Tbilisi', 'Northern Mariana Islands': 'Saipan', 'Malaysia': 'Kuala Lumpur', 'Aruba': 'Oranjestad', 'Uruguay': 'Montevideo', 'Sao Tome and Principe': 'São Tomé', 'Venezuela (Bolivarian Republic of)': 'Caracas', 'Saint Kitts and Nevis': 'Basseterre', 'South Georgia and the South Sandwich Islands': 'King Edward Point', 'Jamaica': 'Kingston', 'Belarus': 'Minsk', 'Saint Martin (French part)': 'Marigot', 'Portugal': 'Lisbon', 'Guyana': 'Georgetown', 'Martinique': 'Fort-de-France', 'French Guiana': 'Cayenne', 'Cook Islands': 'Avarua', 'Tuvalu': 'Funafuti', 'Estonia': 'Tallinn', 'Antigua and Barbuda': "Saint John's", 'Guernsey': 'St. Peter Port', 'Haiti': 'Port-au-Prince', 'Mali': 'Bamako', 'Gambia': 'Banjul', 'Micronesia (Federated States of)': 'Palikir', 'Armenia': 'Yerevan', 'Netherlands': 'Amsterdam', 'Austria': 'Vienna', 'Saint Barthélemy': 'Gustavia', 'Djibouti': 'Djibouti', 'Myanmar': 'Naypyidaw', 'Hungary': 'Budapest', 'Belize': 'Belmopan', 'Denmark': 'Copenhagen', 'Macao': '', 'Nepal': 'Kathmandu', 'Guatemala': 'Guatemala City', 'Chad': "N'Djamena", 'Algeria': 'Algiers'}

In [48]:
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

def countries_component(doc):
    matcher = PhraseMatcher(nlp.vocab)
    patterns = list(nlp.pipe(COUNTRIES))
    matcher.add('COUNTRY', None, *patterns)
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE') 
                for matcher_id, start, end in matcher(doc)]
    return doc

#Add the component to the pipeline
#nlp.add_pipe(countries_component)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: capitals.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital 
Span.set_extension('capital', getter=get_capital, force=True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")

print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


## Scaling and performance

#### Processing large volumes of text as streams

- `nlp.pipe` method processes texts as stream, yields `Doc` objects
- **Bad** : `docs = [nlp(text) for text in LOTS_OF_TEXTS]`
- **Good**: `docs = nlp.pipe(LOTS_OF_TEXTS)`

In [5]:
TEXTS = ['McDonalds is my favorite restaurant.', 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..', 'People really still eat McDonalds :(', 'The McDonalds in Spain has chicken wings. My heart is so happy ', '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P', 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D', 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

In [50]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible', 'gettin', 'payin']


In [10]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Process the texts and print the entities
docs = list(nlp.ppe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) (@McDonalds,) (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) (WANT, McRib) (This morning,)


In [13]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']
#Create a list of patterns for the Phrase Matcher
patterns = list(nlp.pipe(people))

#### Passing in context

- Setting `as_tuples=True` on `nlp.pipe` lets you pass in `(text, context)` tuples
- yields `(doc, context)` tuples
- useful for associating metadata with `doc`

In [14]:
DATA = [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.', {'book': 'Metamorphosis', 'author': 'Franz Kafka'}), ("I know not all that may be coming, but be it what it will, I'll go to it laughing.", {'book': 'Moby-Dick or, The Whale', 'author': 'Herman Melville'}), ('It was the best of times, it was the worst of times.', {'book': 'A Tale of Two Cities', 'author': 'Charles Dickens'}), ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.', {'book': 'On the Road', 'author': 'Jack Kerouac'}), ('It was a bright cold day in April, and the clocks were striking thirteen.', {'book': '1984', 'author': 'George Orwell'}), ('Nowadays people know the price of everything and the value of nothing.', {'book': 'The Picture Of Dorian Gray', 'author': 'Oscar Wilde'})]

In [16]:
# Import the Doc class
from spacy.tokens import Doc

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None, force=True)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None, force=True)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    #Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



The same technique is useful for a variety of tasks. For example, you could pass in page or paragraph numbers to relate the processed Doc back to the position in a larger document. Or you could pass in other structured data like IDs referring to a knowledge base.

#### Selective processing

In [17]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

In [19]:
# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [20]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(American, College Park, Georgia)


# Training and updating models

We will try how to update spaCy's statistical models to customize them for our use case – for example, to predict a new entity type in online comments. We will write our own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make our custom NLP projects more successful.

Spacy's components are supervised models for text annotations, meaning they can only learn to reproduce examples, not guess new labels from raw text. 

## Creating training data

spacy's rule based `Matcher` is a great way to quickly create training data for named entity models.

In [44]:
TEXTS = ['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']

In [45]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

print(nlp.pipe_names)
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

#Patterns are the list of dictionaries describing the tokens
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'},
            {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'},
            {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

['tagger', 'parser', 'ner']


Use the above patterns to quickly bootstrap some training data for our model.

In [46]:
from spacy.tokens import Span
from spacy.matcher import Matcher

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

TRAINING_DATA = []
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)
    
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matches]
    
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    #print(doc.text, entities)
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


Now we have created some training examples using the Matcher and with the patterns. Before we train a model with the data, we always want to double-check that the matcher didn't identify any false positives. But that process is still much faster than doing everything manually.

## Setting up the pipeline from scratch

Let's prepare the spacy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text - for example, "iPhone X"

In [47]:
# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

## Building a training loop

In [48]:
print(TRAINING_DATA)

[('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]}), ('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('I need a new phone! Any tips?', {'entities': []})]


In [49]:

#Start the training
nlp.begin_training()

#Loop for 10 iterations
for itn in range(10):
    #Shuffle the training data
    random.shuffle(TRAINING_DATA)
    
    losses = {}
    
    # Batch the examples and iterate over them 
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        #Update the model
        nlp.update(texts, annotations, losses = losses)
        print(losses)

{'ner': 10.400000095367432}
{'ner': 21.78433907032013}
{'ner': 31.6200110912323}
{'ner': 6.9140013456344604}
{'ner': 13.714298665523529}
{'ner': 18.08437442779541}
{'ner': 2.892008237540722}
{'ner': 4.254102131351829}
{'ner': 7.10178685025312}
{'ner': 2.165703824372031}
{'ner': 2.9505524254855118}
{'ner': 4.399802905296383}
{'ner': 0.5348995538088275}
{'ner': 1.7707562224250069}
{'ner': 8.03996536892464}
{'ner': 2.747800108820229}
{'ner': 3.0647464738540293}
{'ner': 3.388460897228356}
{'ner': 0.1266126869886648}
{'ner': 1.515483751281863}
{'ner': 1.5165760675422888}
{'ner': 0.012677417390364099}
{'ner': 1.1824903084988136}
{'ner': 1.182629649928657}
{'ner': 0.7028325000344466}
{'ner': 0.7031372110304019}
{'ner': 0.7031603734277141}
{'ner': 3.566338841665129e-05}
{'ner': 1.9963283586681655}
{'ner': 1.9963336612291895}


We have trained the spacy mode. Numbers on the right indicate the loss on each iteration, amount of work left for the optimizer. Lower the number, the better. In real life, we normally want to use a lot more data than this, ideally atleast a few hundred or a few thousand examples.

## Exploring the model

In [42]:
TEST_DATA = ['Apple is slowing down the iPhone 8 and iPhone X - how to stop it', "I finally understand what the iPhone X 'notch' is for", 'Everything you need to know about the Samsung Galaxy S9', 'Looking to compare iPad models? Here’s how the 2018 lineup stacks up', 'The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple', 'what is the cheapest ipad, especially ipad pro???', 'Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics']

In [50]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
(iPhone, iPhone) 


I finally understand what the iPhone X 'notch' is for
(iPhone,) 


Everything you need to know about the Samsung Galaxy S9
() 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
() 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
(iPhone 8, iPhone 8) 


what is the cheapest ipad, especially ipad pro???
() 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
() 




## Training best practices

In [51]:
TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'GPE')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'GPE')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'GPE')]})
]

In [52]:
print(*TRAINING_DATA, sep='\n')

('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'GPE')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'GPE')]})
("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'GPE')]})


### Training multiple labels

In [61]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0,6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(18, 25, 'WEBSITE'), (0, 9,'PERSON')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE'), (15, 29,'PERSON')]})
]

In [62]:
TRAINING_DATA[1]

('PewDiePie smashes YouTube record',
 {'entities': [(18, 25, 'WEBSITE'), (0, 9, 'PERSON')]})

We are doing this labeling by hand. In real life, we probably want to automate this and use an annotation tool - Eg: [Brat](http://brat.nlplab.org/), an open source solution or [Prodigy](https://prodi.gy/), explosion.ai annotation tool that integrates with spacy.