# Advanced NLP with spaCy


## 1. Finding words, phrases, names and concepts

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Create by processing a string of text with the nlp object
doc = nlp("Hello world!")
type(doc)

# Itearate over tokens in a Doc
for token in doc:
    print(token.text)
    
# Index into the Doc to get a single Token
token = doc[1]
type(token)

# A slice from the Doc is a Span object
span = doc[1:4]
type(span)

# Get span text
print(span.text)


Hello
world
!
world!


### Token lexical attributes

In [2]:
doc = nlp("It costs, $5")
print('Index: ', [token.i for token in doc])
print('Text: ', [token.text for token in doc])
print('is_alpha: ', [token.is_alpha for token in doc])
print('is_punct: ', [token.is_punct for token in doc])
print('like_num: ', [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4]
Text:  ['It', 'costs', ',', '$', '5']
is_alpha:  [True, True, False, False, False]
is_punct:  [False, False, True, False, False]
like_num:  [False, False, False, False, True]


#### German example

In [3]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


### Lexical attributes

In [4]:
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


### Statistical models

- Enable spaCy to predict linguistic attributes in context
  - Part-of-speach tags
  - Syntatic dependencies
  - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

### Pretrained model packages
- en_core_web_sm (trained on web text)
  - Binary weights
  - Vocabulary
  - Meta information (language, pipeline)
- de_core_news_sm

### Predict part of speach

In [11]:
import spacy

# Load small Eanglish model
nlp = spacy.load('en_core_web_sm')

# Process text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    
    # Pring the text and the predicted part-of-speach tag
    print(token.text, token.pos_)
    
    # In this case return id of element
    print(token.text, token.pos)

She PRON
She 95
ate VERB
ate 100
the DET
the 90
pizza NOUN
pizza 92


### Predict Syntatic Dependencies

In [9]:
# Process text
doc = nlp("She ate the pizza")

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


*Label defenitions*
*nsubj* -> nominal subject - She
*dobj* -> direct object - pizza
*det* -> determiner (article) - the

### Predict Named Entities

In [18]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    
    # Print entity text and its label
    print(ent.text, ent.label_, ent.label)

Apple ORG 383
U.K. GPE 384
$1 billion MONEY 394


GPE - geo political entity
To provide quick definitions to tags and labels used following method

In [17]:
# SYM, relcl, quantmod etc.
spacy.explain('quantmod')

'modifier of quantifier'

### Predicting linguistic annotations

In [15]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      compound  
official    NOUN      ROOT      
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


### Rule based matcher

- Matcher on Doc objects, not just strings
- Matcher on tokens and token attributes
- Use the model's predictions
- Example: find "duck" (verb) not "duck" (noun)

#### Match patterns
- List of dictionaries, one per token
- Match exact token texts
In this example looking for two tokens in the text:
```
[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
```
- Match lexical attributes
```
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
```
- Match any token attributes
Lemma is a base form it match:
  - buying a milk
  - drinking a coffe
```
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
```

In [20]:
import spacy

# Import matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

for match_id, start, end in matches:
    # match_id: hash value of the pattern name
    # start: start index of matched span
    # end: end index of matched span
    
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [25]:
matcher = Matcher(nlp.vocab)
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
matcher.add('WORLD_CUP_PATTERN', None, pattern)

doc = nlp("2018 FIFA World Cup: France won!")
matches = matcher(doc)
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In [31]:
matcher = Matcher(nlp.vocab)
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('LOVE', None, pattern)
doc = nlp("I loved dogs but now I love cats more.")

matches = matcher(doc)
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


In [47]:
# Matching operators and quantifiers
matcher = Matcher(nlp.vocab)
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
matcher.add('BUY_THINGS', None, pattern)

doc = nlp("I bought a smartphone. Now I'm buying apps")
matches = matcher(doc)
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


```{'OP': '!'}``` - negation: match 0 times
```{'OP': '?'}``` - optional: match 0 or 1 times
```{'OP': '+'}``` - match 1 or more times
```{'OP': '*'}``` - match 0 or more times

#### Examples

In [48]:
# Using the Matcher
# Import the Matcher and initialize it with the shared vocabulary
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_X_PATTERN', None, pattern)

doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Use the matcher on the doc
matches = matcher(doc)
print('Matches:', [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


In [43]:
# Writing match patterns
matcher = Matcher(nlp.vocab)
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [45]:
# Writing match patterns
matcher = Matcher(nlp.vocab)
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses



## 2. Large-scale data analysis with spaCy

### Data structures: Vocab, Lexemes and StringStore

#### Shared vocab and string store
 - Vocab: stores data shared accross multiple documents
 - To save memory, spyCy encodes all strings to hash values
 - Strings are only stored once in the StringStore via npl.vocab.strings
 - String store: lookup table in both directions

In [51]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings['coffee_hash']
print(coffee_hash)
print(coffee_string)

3197928453018144401
12035782083212280080


 - hashes can't be reversed - that's why we need to provice the shared vocab
 - lookup the string and hash in nlp.vocab.strings

In [56]:
doc = nlp("I love coffee")

hash_value = nlp.vocab.strings['coffee']
print('hash value:', hash_value)

# Rise an error if we haven't seen the string before
string = nlp.vocab.strings[hash_value]
print('string value: ', string)

hash value: 3197928453018144401
string value:  coffee


 - the doc also exposes the vocab and strings

In [58]:
doc = nlp("I love coffee")
print('hash value: ', doc.vocab.strings['coffee'])

hash value:  3197928453018144401


#### Lexemes: entries in the vocabulary
 - a Lexeme object is an entry in the vocabulary

In [61]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']
# print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


 - contains the context-independent information about word
    - word text: lexeme.text and lexeme.orth (the hash)
    - lexical attributes like lexeme.is_alpha
    - not context-dependent part-of-speach tags, dependencies or entity labels

![id](images/vocab_hashes_lexemes.png "Vocab, hashes and lexemes") 

### Data structures: Doc, Span and Token

In [71]:
# Create an nlp ogject
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words = words, spaces = spaces)
print("Doc text: ", doc.text)

# Create a span manually
span = Span(doc, 0, 2)
print("Span text: ", span.text)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label = "GREETING")
print("Span with label text: ", span_with_label.text)

# Add span to the doc.ents
doc.ents = [span_with_label]
print("New doc text: ", doc.text)

Doc text:  Hello world!
Span text:  Hello world
Span with label text:  Hello world
New doc text:  Hello world!


![id](images/span_object.png "The Span object")

#### Best practices
 - Doc and Span are very poverfull and hod reference and relationships of words and sentences
   - Convert result to strings as late as possible
   - Use token attributes if available - for example, token.i for the token index
 

#### Excercise

In [72]:
# Excersise
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


In [98]:
# Excersise
nlp = spacy.load('en_core_web_sm')
doc = nlp("Berlin is a nice city")
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'AUX':
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!


### Word vector and semantic similarity

#### Comparing semantic similarity
- spaCy can compare two objects and predict similarity (Span, Doc, Token)
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)
- Important: needs a model that has word vector included, for example:
  - YES: en_core_web_md (medium model)
  - YES: en_core_web_lg (large model)
  - YES: en_core_web_sm (small model)

#### Similarity examples

In [3]:
import spacy

# Load a large model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

I like fast food
0.8627203210548107


In [4]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


In [5]:
# Compare a document with a token
doc = nlp("I like pizza")
token=nlp("soap")[0]

print(doc.similarity(token))

0.3253198600655889


In [6]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


#### How does spaCy predict similarity?
- Similarity is determined using word vectors
- Multy-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Can be added to spaCy's statistical model
- Default: consine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

#### Word vector in spaCy

In [10]:
doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

#### Similarity depends on the application context
- Useful for many applications: recommendation system, flagging duplicates etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do

In [11]:
# Texts is similar bcs both of them about cats
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")
print(doc1.similarity(doc2))

0.9501447503553421


In [12]:
doc1 = nlp("like")
doc2 = nlp("hate")
print(doc1.similarity(doc2))

0.6574650996652592


### Combining models and rules

|                     | Statistical models                                          | Rule-based systems                                     |   |   |
|---------------------|-------------------------------------------------------------|--------------------------------------------------------|---|---|
| Use cases           | applications needs to generalize based on examples          | dictionary with finite number of examples              |   |   |
| Real-world examples | product names, person names, subject/object relationships   | countries of the world, cities, drug names, dog breeds |   |   |
| spaCy features      | entity recognizer, dependency parser, part-of-speech tagger | tokenizer, Matcher, PhraseMatcher                      |   |   |

#### Recap: Rule-based Matching

In [27]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
# pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER', 'cats'}]
# matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]
matcher.add('HAPPY', None, pattern)

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

print('Matches:', [doc[start:end].text for match_id, start, end in matches])

Matches: ['very happy', 'very very happy']


#### Adding statistical predictions

In [32]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriver'}])
doc = nlp("I have a Golden Retriver")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matcher span:', span.text)
    
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    
    # Get the previous token and its POS tag
    print('Previous token: ', doc[start - 1].text, doc[start - 1].pos_)

Matcher span: Golden Retriver
Root token: Retriver
Root head token: have
Previous token:  a DET


#### Efficient phrase matching
- PhraseMatcher like regular expressions or keyword search - but access to the tokens!
- Takes Doc object as patterns
- More efficient and faster than the Matcher
- Great for matching large word lists

In [34]:
from spacy.matcher import PhraseMatcher

doc = nlp("I have a Golden Retriver")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matcher span:', span.text)

Matcher span: Golden Retriver


#### Debugging patterns

In [59]:
doc = nlp("Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to ad-free viewing until October 15.")

# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'LOWER': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


In [58]:
[token.text for token in nlp("ad-free viewing")]

['ad', '-', 'free', 'viewing']

In [64]:
COUNTRIES = ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)
doc = nlp("Czech Republic may help Slovakia protect its airspace")
# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

# Result:
# [Czech Republic, Slovakia]

[Czech Republic, Slovakia]


In [82]:
# Extracting countries and relationship
from spacy.tokens import Doc, Span

COUNTRIES = ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

# Create a doc and find matches in it
text = "After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a \"false renaissance\" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission to Bosnia faced \"worldwide ridicule\" for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew his nation's funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of the organization's effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered \"systemic failure\". One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization's history. The Millennium Summit was held in 2000 to discuss the UN's role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN's focus on promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to global needs."
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")
    print(span)

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + #[span] # overlapping issue here that need to be fixed

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

Namibia
in --> Namibia
South Africa
in --> South Africa
Cambodia
Africa --> Cambodia
Kuwait
of --> Kuwait
Somalia
as --> Somalia
Haiti
Somalia --> Haiti
Mozambique
Haiti --> Mozambique
Somalia
in --> Somalia
Rwanda
for --> Rwanda
Singapore
Britain --> Singapore
Sierra Leone
War --> Sierra Leone
Afghanistan
of --> Afghanistan
Iraq
invaded --> Iraq
Sudan
in --> Sudan
Congo
of --> Congo
Haiti
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('US', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Yugoslavia', 'GPE'), ('Somalia', 'GPE'), ('US', 'GPE'), ('Mogadishu', 'GPE'), ('Bosnia', 'GPE'), ('Rwanda', 'GPE'), ('US', 'GPE'), ('Britain', 'GPE'), ('Singapore', 'GPE'), ('the United States', 'GPE'), ('Afghanistan', 'GPE'), ('the United States', 'GPE'), ('Iraq', 'GPE'), ('Darfur', 'GPE'), ('Sudan', 'GPE'), ('Kivu', 'GPE'), ('the Democratic Republic of Congo', 'GPE'), ('Haiti', 'GPE')]


## 3. Processing Pipelines

### Processing pipeline

![id](images/nlp_processing_process.png "What happens when you call hlp?")

In [None]:
doc = nlp("This is a sentence.")

#### Build-in pipeline components

| Name    | Description             | Creates                                           |
|---------|-------------------------|---------------------------------------------------|
| tagger  | Part-of-speech tagger   | Token.tag                                         |
| parser  | Dependency parser       | Token.dep, Token.head, Doc.sents, Doc.noun_chunks |
| ner     | Named entity recognizer | Doc.ents, Token.ent_iob, Token.ent_type           |
| textcat | Text classifier         | Doc.cats                                          |

Text classifier not included in any pretrained model bcs it's very specific for for any particular text. But it can be used to train own model with it.

#### Under the hood

- Pipeline defined in model's meta.json in order
- Built-in components need binary data to make predictions

#### Pipeline attributes

- ```nlp.pipe_names```: list of pipeline component names
- ```nlp.pipeline```: list of (name, component) names

In [6]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()
doc = nlp("This is a sentence.")
print(nlp.pipe_names)
print(nlp.pipeline)

[]
[]


#### Examples

##### 1. Inspecting the pipeline

In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f8d63597048>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f8d5d799768>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f8d5d7997c8>)]


### Custom pipeline components

- Make a function execute automatically when you call nlp
- Add your own metadata to documents and tokens
- Updating built-in attributes like doc.ents

#### Anatomy of component
- Function that takes a doc, modifies it and return it
- Can be added using the nlp.add_pipe method

```
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)
```

To specify where to add component following argument used

| Argument | Description          | Example                                 |
|----------|----------------------|-----------------------------------------|
| last     | If True, add last    | nlp.add_pipe(component, last=True)      |
| first    | If True, all first   | nlp.add_pipe(component, first=True)     |
| before   | Add before component | nlp.add_pipe(component, before='ner')   |
| after    | Add after component  | nlp.add_pipe(component, after='tagger') |


#### Example: a simple component

In [12]:
import spacy

nlp = spacy.load('en_core_web_sm')

def custom_component(doc):
    print('Doc length:', len(doc))
    # Return modified doc
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)
# Pring the pipe;ome component names
print('Pipeline:', nlp.pipe_names)

# Process doc
doc = nlp("Hello world!")

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Doc length: 3


#### Example: Complex component

In this exercise, you'll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents.

A PhraseMatcher with the animal patterns has already been created as the variable matcher. The small English model is available as the variable nlp. The Span object has already been imported for you.

animal_patterns: \[Golden Retriever, cat, turtle, Rattus norvegicus\]

In [22]:
import spacy
from spacy.tokens import Span
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
matcher.add('CAT', None, [{'LOWER': 'cat'}])
nlp = spacy.load('en_core_web_sm')

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL')
             for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


### Extension attributes

#### Extension attributes
##### Setting custom attributes
- add custom metadata to document, tokens and spans, data can be added once or computed dynamically
- accessible via the ._ property

```
doc._.title = 'My documents'
tokens._.is_color = True
span._.has_color = False
```

- register on the global Doc, Token or Span using set_extension method

##### Extension attribute types
1. Attribute extensions
2. Property extensions
3. Method extensions

##### Attribute extensions
- set a default value that can be overwritten

##### Property extensions
- Define a getter and an optional setter function
- Getter only called when you retrive the attribute value
- Span extensions should almost always use a getter

In [24]:
# Import global clsasses
from spacy.tokens import Doc, Token, Span

# Set extesions on the Doc, Token, Span
Doc.set_extension('Title', default = None)
Token.set_extension('is_color', default = False)
Span.set_extension('has_color', default = False)

# Set extension with default value
Doc.set_extension('Title', default = None)

# Overwrite extension with default value
doc[3]._.is_color = True

In [26]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors
    
# Set extension on the Token with getter
Token.set_extension('is_color', getter = get_is_color, force = True)

doc=nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


In [28]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)
    
# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color, force = True)

doc=nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


##### Method extensions
- assign a function that becomes available as an object method
- let you pass arguments to the extesion function

In [35]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method = has_token, force = True)
doc = nlp("The sky is blue.")

print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud


##### Example: set extension attribute

In [36]:
# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]
  
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed, force=True)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [39]:
# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter = get_has_number, force=True)

# Process the text and check the custom has_number attribute 
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

has_number: True


In [40]:
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html, force = True)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

<strong>Hello world</strong>


##### Example: entities and extensions
In this exercise, you'll combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

The Span class is already imported and the nlp object has been created for you.

In [45]:
import spacy

nlp = spacy.load('en_core_web_sm')

def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url, force = True)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


##### Example: components with extensions
Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.

The nlp object has already been created and the Span class is already imported. A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable capitals.

In [62]:
import spacy

nlp = spacy.load('en_core_web_sm')

capitals = {'Afghanistan': 'Kabul',
 'Albania': 'Tirana',
 'Algeria': 'Algiers',
 'American Samoa': 'Pago Pago',
 'Andorra': 'Andorra la Vella',
 'Angola': 'Luanda',
 'Anguilla': 'The Valley',
 'Antarctica': '',
 'Antigua and Barbuda': "Saint John's",
 'Argentina': 'Buenos Aires',
 'Armenia': 'Yerevan',
 'Aruba': 'Oranjestad',
 'Australia': 'Canberra',
 'Austria': 'Vienna',
 'Azerbaijan': 'Baku',
 'Bahamas': 'Nassau',
 'Bahrain': 'Manama',
 'Bangladesh': 'Dhaka',
 'Barbados': 'Bridgetown',
 'Belarus': 'Minsk',
 'Belgium': 'Brussels',
 'Belize': 'Belmopan',
 'Benin': 'Porto-Novo',
 'Bermuda': 'Hamilton',
 'Bhutan': 'Thimphu',
 'Bolivia (Plurinational State of)': 'Sucre',
 'Bonaire, Sint Eustatius and Saba': 'Kralendijk',
 'Bosnia and Herzegovina': 'Sarajevo',
 'Botswana': 'Gaborone',
 'Bouvet Island': '',
 'Brazil': 'Brasília',
 'British Indian Ocean Territory': 'Diego Garcia',
 'Brunei Darussalam': 'Bandar Seri Begawan',
 'Bulgaria': 'Sofia',
 'Burkina Faso': 'Ouagadougou',
 'Burundi': 'Bujumbura',
 'Cabo Verde': 'Praia',
 'Cambodia': 'Phnom Penh',
 'Cameroon': 'Yaoundé',
 'Canada': 'Ottawa',
 'Cayman Islands': 'George Town',
 'Central African Republic': 'Bangui',
 'Chad': "N'Djamena",
 'Chile': 'Santiago',
 'China': 'Beijing',
 'Christmas Island': 'Flying Fish Cove',
 'Cocos (Keeling) Islands': 'West Island',
 'Colombia': 'Bogotá',
 'Comoros': 'Moroni',
 'Congo': 'Brazzaville',
 'Congo (Democratic Republic of the)': 'Kinshasa',
 'Cook Islands': 'Avarua',
 'Costa Rica': 'San José',
 'Croatia': 'Zagreb',
 'Cuba': 'Havana',
 'Curaçao': 'Willemstad',
 'Cyprus': 'Nicosia',
 'Czech Republic': 'Prague',
 "Côte d'Ivoire": 'Yamoussoukro',
 'Denmark': 'Copenhagen',
 'Djibouti': 'Djibouti',
 'Dominica': 'Roseau',
 'Dominican Republic': 'Santo Domingo',
 'Ecuador': 'Quito',
 'Egypt': 'Cairo',
 'El Salvador': 'San Salvador',
 'Equatorial Guinea': 'Malabo',
 'Eritrea': 'Asmara',
 'Estonia': 'Tallinn',
 'Ethiopia': 'Addis Ababa',
 'Falkland Islands (Malvinas)': 'Stanley',
 'Faroe Islands': 'Tórshavn',
 'Fiji': 'Suva',
 'Finland': 'Helsinki',
 'France': 'Paris',
 'French Guiana': 'Cayenne',
 'French Polynesia': 'Papeetē',
 'French Southern Territories': 'Port-aux-Français',
 'Gabon': 'Libreville',
 'Gambia': 'Banjul',
 'Georgia': 'Tbilisi',
 'Germany': 'Berlin',
 'Ghana': 'Accra',
 'Gibraltar': 'Gibraltar',
 'Greece': 'Athens',
 'Greenland': 'Nuuk',
 'Grenada': "St. George's",
 'Guadeloupe': 'Basse-Terre',
 'Guam': 'Hagåtña',
 'Guatemala': 'Guatemala City',
 'Guernsey': 'St. Peter Port',
 'Guinea': 'Conakry',
 'Guinea-Bissau': 'Bissau',
 'Guyana': 'Georgetown',
 'Haiti': 'Port-au-Prince',
 'Heard Island and McDonald Islands': '',
 'Holy See': 'Rome',
 'Honduras': 'Tegucigalpa',
 'Hong Kong': 'City of Victoria',
 'Hungary': 'Budapest',
 'Iceland': 'Reykjavík',
 'India': 'New Delhi',
 'Indonesia': 'Jakarta',
 'Iran (Islamic Republic of)': 'Tehran',
 'Iraq': 'Baghdad',
 'Ireland': 'Dublin',
 'Isle of Man': 'Douglas',
 'Israel': 'Jerusalem',
 'Italy': 'Rome',
 'Jamaica': 'Kingston',
 'Japan': 'Tokyo',
 'Jersey': 'Saint Helier',
 'Jordan': 'Amman',
 'Kazakhstan': 'Astana',
 'Kenya': 'Nairobi',
 'Kiribati': 'South Tarawa',
 "Korea (Democratic People's Republic of)": 'Pyongyang',
 'Korea (Republic of)': 'Seoul',
 'Kuwait': 'Kuwait City',
 'Kyrgyzstan': 'Bishkek',
 "Lao People's Democratic Republic": 'Vientiane',
 'Latvia': 'Riga',
 'Lebanon': 'Beirut',
 'Lesotho': 'Maseru',
 'Liberia': 'Monrovia',
 'Libya': 'Tripoli',
 'Liechtenstein': 'Vaduz',
 'Lithuania': 'Vilnius',
 'Luxembourg': 'Luxembourg',
 'Macao': '',
 'Macedonia (the former Yugoslav Republic of)': 'Skopje',
 'Madagascar': 'Antananarivo',
 'Malawi': 'Lilongwe',
 'Malaysia': 'Kuala Lumpur',
 'Maldives': 'Malé',
 'Mali': 'Bamako',
 'Malta': 'Valletta',
 'Marshall Islands': 'Majuro',
 'Martinique': 'Fort-de-France',
 'Mauritania': 'Nouakchott',
 'Mauritius': 'Port Louis',
 'Mayotte': 'Mamoudzou',
 'Mexico': 'Mexico City',
 'Micronesia (Federated States of)': 'Palikir',
 'Moldova (Republic of)': 'Chișinău',
 'Monaco': 'Monaco',
 'Mongolia': 'Ulan Bator',
 'Montenegro': 'Podgorica',
 'Montserrat': 'Plymouth',
 'Morocco': 'Rabat',
 'Mozambique': 'Maputo',
 'Myanmar': 'Naypyidaw',
 'Namibia': 'Windhoek',
 'Nauru': 'Yaren',
 'Nepal': 'Kathmandu',
 'Netherlands': 'Amsterdam',
 'New Caledonia': 'Nouméa',
 'New Zealand': 'Wellington',
 'Nicaragua': 'Managua',
 'Niger': 'Niamey',
 'Nigeria': 'Abuja',
 'Niue': 'Alofi',
 'Norfolk Island': 'Kingston',
 'Northern Mariana Islands': 'Saipan',
 'Norway': 'Oslo',
 'Oman': 'Muscat',
 'Pakistan': 'Islamabad',
 'Palau': 'Ngerulmud',
 'Palestine, State of': 'Ramallah',
 'Panama': 'Panama City',
 'Papua New Guinea': 'Port Moresby',
 'Paraguay': 'Asunción',
 'Peru': 'Lima',
 'Philippines': 'Manila',
 'Pitcairn': 'Adamstown',
 'Poland': 'Warsaw',
 'Portugal': 'Lisbon',
 'Puerto Rico': 'San Juan',
 'Qatar': 'Doha',
 'Republic of Kosovo': 'Pristina',
 'Romania': 'Bucharest',
 'Russian Federation': 'Moscow',
 'Rwanda': 'Kigali',
 'Réunion': 'Saint-Denis',
 'Saint Barthélemy': 'Gustavia',
 'Saint Helena, Ascension and Tristan da Cunha': 'Jamestown',
 'Saint Kitts and Nevis': 'Basseterre',
 'Saint Lucia': 'Castries',
 'Saint Martin (French part)': 'Marigot',
 'Saint Pierre and Miquelon': 'Saint-Pierre',
 'Saint Vincent and the Grenadines': 'Kingstown',
 'Samoa': 'Apia',
 'San Marino': 'City of San Marino',
 'Sao Tome and Principe': 'São Tomé',
 'Saudi Arabia': 'Riyadh',
 'Senegal': 'Dakar',
 'Serbia': 'Belgrade',
 'Seychelles': 'Victoria',
 'Sierra Leone': 'Freetown',
 'Singapore': 'Singapore',
 'Sint Maarten (Dutch part)': 'Philipsburg',
 'Slovakia': 'Bratislava',
 'Slovenia': 'Ljubljana',
 'Solomon Islands': 'Honiara',
 'Somalia': 'Mogadishu',
 'South Africa': 'Pretoria',
 'South Georgia and the South Sandwich Islands': 'King Edward Point',
 'South Sudan': 'Juba',
 'Spain': 'Madrid',
 'Sri Lanka': 'Colombo',
 'Sudan': 'Khartoum',
 'Suriname': 'Paramaribo',
 'Svalbard and Jan Mayen': 'Longyearbyen',
 'Swaziland': 'Lobamba',
 'Sweden': 'Stockholm',
 'Switzerland': 'Bern',
 'Syrian Arab Republic': 'Damascus',
 'Taiwan': 'Taipei',
 'Tajikistan': 'Dushanbe',
 'Tanzania, United Republic of': 'Dodoma',
 'Thailand': 'Bangkok',
 'Timor-Leste': 'Dili',
 'Togo': 'Lomé',
 'Tokelau': 'Fakaofo',
 'Tonga': "Nuku'alofa",
 'Trinidad and Tobago': 'Port of Spain',
 'Tunisia': 'Tunis',
 'Turkey': 'Ankara',
 'Turkmenistan': 'Ashgabat',
 'Turks and Caicos Islands': 'Cockburn Town',
 'Tuvalu': 'Funafuti',
 'Uganda': 'Kampala',
 'Ukraine': 'Kiev',
 'United Arab Emirates': 'Abu Dhabi',
 'United Kingdom of Great Britain and Northern Ireland': 'London',
 'United States Minor Outlying Islands': '',
 'United States of America': 'Washington, D.C.',
 'Uruguay': 'Montevideo',
 'Uzbekistan': 'Tashkent',
 'Vanuatu': 'Port Vila',
 'Venezuela (Bolivarian Republic of)': 'Caracas',
 'Viet Nam': 'Hanoi',
 'Virgin Islands (British)': 'Road Town',
 'Virgin Islands (U.S.)': 'Charlotte Amalie',
 'Wallis and Futuna': 'Mata-Utu',
 'Western Sahara': 'El Aaiún',
 'Yemen': "Sana'a",
 'Zambia': 'Lusaka',
 'Zimbabwe': 'Harare',
 'Åland Islands': 'Mariehamn'}

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(capitals.keys()))
matcher.add('COUNTRY', None, *patterns)

def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: capitals.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital 
Span.set_extension('capital', getter=get_capital, force = True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['tagger', 'parser', 'ner', 'countries_component']
Ents: Czech Republic may help Slovakia protect its airspace (Czech Republic, Slovakia)
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


### Scaling and performance

#### Processing large volumes of text
- Use nlp.pipe method
- Processes texts as a stream, yields Doc objects
- Much faster than calling nlp on each text\
BAD:\
```docs = [nlp(text) for text in LOTS_OF_TEXTS]```
GOOD:\
```docs = list(nlp.pipe(LOTS_OF_TEXTS))```

#### Passing in context
- Setting ```as_tuples = True``` on ```nlp.pipe``` lets you pass in ```(text, context)``` tuples
- Yields ```(doc, context)``` tuples
- Useful for associating metadata with the ```doc```


In [64]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16})
]
for doc, context in nlp.pipe(data, as_tuples = True):
    print(doc.text, context['page_number'])

Ents: This is a text ()
This is a text 15
Ents: And another text ()
And another text 16


In [69]:
from spacy.tokens import Doc

Doc.set_extension('id', default = None, force = True)
Doc.set_extension('page_number', default = None, force = True)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16})
]
for doc, context in nlp.pipe(data, as_tuples = True):
    print(context)
    doc._.id = context['id']
    doc._.page_number = context['page_number']

Ents: This is a text ()
{'id': 1, 'page_number': 15}
Ents: And another text ()
{'id': 2, 'page_number': 16}


#### Using only the tokenizer
- don't run the whole pipeline!
- use ```nlp.make_doc``` to tuen a text into a Doc object\
BAD:
```doc = nlp("Hello world")```
GOOD:
```doc = nlp.make_doc("Hello world")```

#### Disable pipeline components
- use ```nlp.disable_pipes``` to temporarily disable one or more pipes

```
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
   # Process the text and print the entities
   doc = nlp(text)
   print(doc.ents)
```
- restores them after the ```with``` block
- only runs the remaining components

#### Example: processing strams
In this exercise, you'll be using nlp.pipe for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

In [76]:
import spacy

nlp = spacy.load('en_core_web_sm')

TEXTS = ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']
# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['open', 'BAD']
['terrible']


In [77]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) (hurry,) ()


In [78]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = [nlp(person) for person in people] # bad
patterns = list(nlp.pipe(people)) # good

#### Example: processing data with context
In this exercise, you'll be using custom attributes to add author and book meta information to quotes.

A list of (text, context) examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'. The nlp object has already been created for you.

In [79]:
# Import the Doc class
from spacy.tokens import Doc

DATA = [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default = None, force = True)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default = None, force = True)

for doc, context in nlp.pipe(DATA, as_tuples = True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')


One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



#### Example: selective processing
In this exercise, you'll use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text. The small English model is already loaded in as the nlp object.

In [80]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [81]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(American, College Park, Georgia)



## 4. Training a neural network model

### Training and update models
#### Why updating the model?
- Better resuls on your specific domain
- Learn classification schemes specifically your problem
- Essential for text classifiction
- Very useful for named entity recognition
- Less critical for part-of-speach tagging and dependency parsing

#### How training works
1. **Initialize** the model weights rendomly with ```nlp.begin_training```
2. **Predict** a few examples with the current weights by calling ```nlp.update```
3. **Compare** predictions with true labels
4. **Calculate** how to change wiights to improve predictions
5. **Update** weights slightly
6. Go back t0 2

![id](images/model_training.png "How training works?")

- **Training data**: Examples and their annnotations.
- **Text**: The input text the model should predict a label for.
- **Label**: The label the model should predict.
- **Gradient**: How to change the weights.

#### Example: Training the entity recognizer
- The entity recognizer tags words and phrases in context
- Each token can be only the part of one entity
- Examples need to come with context
```("iPhone X is coming", {'entities': [(0, 8, 'GADGET')])```
- Texts with no entities are also important
```("I need a new phone! Any tips?", {'entities': []})```
- **Goal**: teach the model to generalize

#### The training data
- Examples of what we want the model to predict in context
- Update an **existing model**: a few hundred to a few thousand examples
- Train a **new category**: a few thousand to a million examples
  - spaCy's English models: trained on 2 million words
- Usually created manually by human annotators
- Can be semi-automated - for example, using spaCy's ```Matcher```!



#### Creating training data

spaCy's rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

The nlp object has already been created for you and the Matcher is available as the variable matcher.

In [16]:
import spacy
from spacy.tokens import Span
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

TEXTS = ['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '1'}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': []})
('I need a new phone! Any tips?', {'entities': []})


### The training loop
#### The steps of a training loop
- **Loop** for a number of times
- **Shuffle** the training data
- **Divide** the data into batches
- **Update** the model for each batch

#### Example loop

#### Update an existing model
- Improve prediction on new data
- Especiallly useful to improve existing categories, like `PERSON`
- Also possible to add new categories
- Be carefull and make sure the model doesn't "forget" the old ones

#### Settings up a new pipeline from scratch
Start from blank english model contains only a language data and tokinizations rules

In [21]:
import spacy
import random

examples = [
    ("How to preorder iPhone X", {'entities': [(20, 28, 'GADGET')]})
    # And many more examples ...
]

# Start with blank English model
nlp = spacy.blank('en')

# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()

# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotations in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
# Save model to disk
# nlp.to_disk(path_to_model)

#### Example: Setting up the pipeline
In this exercise, you'll prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text – for exampe, "iPhone X".

spacy has already been imported for you.

In [22]:
import spacy

# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

#### Example: Building a training loop
Let's write a simple training loop from scratch!

The pipeline you've created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label 'GADGET'.

The small set of labelled examples that you've created previously is available as the global variable TRAINING_DATA. To see the examples, you can print them in your script or in the IPython shell. spacy and random have already been imported for you.

In [26]:
import spacy

TRAINING_DATA = [('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]}),
 ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}),
 ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}),
 ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}),
 ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}),
 ('I need a new phone! Any tips?', {'entities': []})]

# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)
    

{'ner': 10.833332419395447}
{'ner': 23.59615170955658}
{'ner': 33.35446310043335}
{'ner': 6.554355084896088}
{'ner': 14.063887894153595}
{'ner': 19.46621137857437}
{'ner': 2.6135728657245636}
{'ner': 5.711166794411838}
{'ner': 9.176530804252252}
{'ner': 1.5220626677037217}
{'ner': 3.885590207937639}
{'ner': 5.911869316420052}
{'ner': 4.600181766785681}
{'ner': 8.52852916996926}
{'ner': 9.536159337498248}
{'ner': 3.132711953483522}
{'ner': 4.775071158539504}
{'ner': 5.988043637713417}
{'ner': 1.3914463706314564}
{'ner': 3.8246281114643352}
{'ner': 4.771497246832951}
{'ner': 0.6008064360357821}
{'ner': 3.37357012436496}
{'ner': 3.389595909142031}
{'ner': 1.343033199444335}
{'ner': 1.3456716429002427}
{'ner': 1.3481531683949015}
{'ner': 0.00028501169049377495}
{'ner': 0.6737396408173026}
{'ner': 0.6739416004238947}


#### Example: Exploring the model
Let's see how the model performs on unseen data! To speed things up a little, here's a trained model for the label 'GADGET', using the examples from the previous exercise, plus a few hundred more. The loaded model is already available as the nlp object. A list of test texts is available as TEST_DATA

In [28]:
# Run previous example before start this one
TEST_DATA = ['Apple is slowing down the iPhone 8 and iPhone X - how to stop it',
 "I finally understand what the iPhone X 'notch' is for",
 'Everything you need to know about the Samsung Galaxy S9',
 'Looking to compare iPad models? Here’s how the 2018 lineup stacks up',
 'The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple',
 'what is the cheapest ipad, especially ipad pro???',
 'Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics']

for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
(iPhone 8, iPhone X) 


I finally understand what the iPhone X 'notch' is for
(iPhone X,) 


Everything you need to know about the Samsung Galaxy S9
() 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
() 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
(iPhone 8, iPhone 8) 


what is the cheapest ipad, especially ipad pro???
() 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
() 




### Training best preactices
#### Problem 1: Models can "forget" things
- Existing model can overfit on new data
  - e.g.: if you only update it with `WEBSITE`, it can "unlearn" what a `PERSON` is
- Also known as "catastrophic forgetting" problem

#### Solution 1: Mix in previously correct predictions
- For example, if you're training `WEBSITE`, also include examples of `PERSON`
- Run existing spaCy model over data and extract all other relevant entities

**BAD:**
```
TRAINING_DATA = [
    ('Reddit is a webstite', {'entities': [(0, 6, 'WEBSITE')]})
]
```

**GOOD:**
```
TRAINING_DATA = [
    ('Reddit is a webstite', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a webstite', {'entities': [(0, 6, 'PERSON')]})
]
```

#### Problem 2: Models can't learn everything
- spaCy's models make predictions based on **local context**
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
  - For example: `CLOTHING` is better than `ADULT_CLOTHING` and `CHILDRENS_CLOTHING`

#### Solution 2: Plan your label scheme carefully
- Pick categories that are reflected in local context
- More generic is better than too specific
- Use rules to go from generic labels to specific categories

**BAD:**
```
LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
```

**GOOD:**
```
LABELS = ['CLOTHINGS', 'BAND']
```

#### Example: Good data vs. bad data
Here's an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

**Solution**
Fix TOURIST_DESTINATION with GPE
Great work! Once the model achieves good results on detecting GPE entities in the traveler reviews, you could add a rule-based component to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

#### Training multiple labels
Here's a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you'll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, Brat, a popular open-source solution, or Prodigy, our own annotation tool that integrates with spaCy.

After this exercise you will be nearly done with the course! If you enjoyed it, feel free to send Ines a thank you via Twitter - she'll appreciate it! Tweet to Ines

Fixed by adding person labels in text
```
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(0, 9, 'PERSON'), (18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE'), (15, 29, 'PERSON')]}),
    # And so on...
]
```

### Wrapping up
#### Your new spaCy skills
- Extracting linguistic features: part-of-speach tags, dependencies. named entities
- Work with pre-trained **statistical models**
- Find words and phrases using `Matcher` and `PhraseMatcher` **match rules**
- Best practices for working with **data structures** `Doc`, `Token`, `Span`, `Vocab`, `Lexeme`
- Find **semantic similarity** using **word vectors**
- Write custom **pipeline components** with **extension attributes**
- **Scale up** your spaCy pipeline and make them fast
- Create **training data** for spaCy' statistical models

#### More things to do with spaCy
- **Training and updating** other pipeline components
  - Part-of-speach tagger
  - Dependency parser
  - Text classifier
- **Customize the tokinizer**
  - Adding rules and exceptions to split text differently
- **Adding or improve support for other languages**
  - 45+ lanuges currently
  - Lots of room for improvement and more languages
  - Allow training models for other languages

For more examples use:
 [spacy.io](https://spacy.io)