#WHAT I LEARNED

1. Extract linguistic features: part-of-speech tahgs, dependencies, named entities

2. Worked with pre-trained stats model

3. Find words and phrases using Matcher and PhraseMatcher match rules

4. Best practices for working with data structures Doc, Token, Span, Vocab, Lexeme

5. Find semantic similarties using word vectors

6. Write custom pipeline componenets with extension attributes

7. Scale up your spaCy pipelines and make them fasters

8. Create training data for spaCy stats models

9. Train and update spaCy neural network models with new data

# Course Description
If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

CHEAT SHEET: https://images.datacamp.com/image/upload/v1653829192/Marketing/Blog/spaCy_Cheat_Sheet_final.pdf

# CHAPTER 1 Finding words, phrases, names and concepts


This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with statistical models, and how to use them to predict linguistic features in your text.

### Getting Started
Let's get started and try out spaCy! In this exercise, you'll be able to try out some of the 45+ available languages.

This course introduces a lot of new concepts, so if you ever need a quick refresher, download the spaCy Cheat Sheet and keep it handy!

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


In [2]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [3]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


### Documents, spans and tokens
When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you'll learn more about the Doc, as well as its views Token and Span.

In [4]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


### Lexical attributes
In this example, you'll use spaCy's Doc and Token objects, and lexical attributes to find percentages in a text. You'll be looking for two subsequent tokens: a number and a percent sign. The English nlp object has already been created.

In [6]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


## Statistical models


### Loading models
Let's start by loading a model. spacy is already imported.

In [7]:
import spacy
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [72]:
# Load the 'de_core_news_sm' model – spaCy is already imported
nlp = spacy.load('de_core_news_sm')

text = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

### Predicting linguistic annotations
You'll now get to try one of spaCy's pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! The small English model is already available as the variable nlp.

To find out what a tag or label means, you can call spacy.explain in the IPython shell. For example: spacy.explain('PROPN') or spacy.explain('GPE').

In [8]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [9]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you're processing. Let's take a look at an example. The small English model is available as the variable nlp.

In [10]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

# look like it didnt label iPhone X

Apple ORG


In [11]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

#you don't always have to do this manually. spaCy's rule-based matcher, which can help you find certain words and phrases in text.

Apple ORG
Missing entity: iPhone X


## Rule-based matching


### Using the Matcher
Let's try spaCy's rule-based Matcher. You'll be using the example from the previous exercise and write a pattern that can match the phrase "iPhone X" in the text. The nlp object and a processed doc are already available.

In [12]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

In [13]:
# Import the Matcher and initialize it with the shared vocabulary
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_X_PATTERN', [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print('Matches:', [doc[start:end].text for match_id, start, end in matches])

#the tokens at doc[1:3] describing the span for "iPhone X".

Matches: ['iPhone X']


### Writing match patterns
In this exercise, you'll practice writing more complex match patterns using different token attributes and operators. A matcher is already initialized and available as the variable matcher.



In [14]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [15]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [16]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# CHAPTER 2 Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You'll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

## Data Structures Vocab, Lexemes and StringStore

### Strings to hashes
The nlp object has already been created for you.

In [17]:
# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [18]:
# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


### Vocab, hashes and lexemes
Why does this code throw an error?

from spacy.lang.en import English
from spacy.lang.de import German

// Create an English and German nlp object
nlp = English()
nlp_de = German()

// Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

// Look up the ID for 'Bowie' in the vocab
print(nlp_de.vocab.strings[bowie_id])
The English language class is already available as the nlp object.

## Data Structures: Doc, Span and Token

Convert results to strings as late as possible

### Creating a Doc
Let's create some Doc objects from scratch! The nlp object has already been created for you.

By the way, if you haven't downloaded it already, check out the spaCy Cheat Sheet. It includes an overview of the most important concepts and methods and might come in handy if you ever need a quick refresher!

In [19]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
sentence= ['spaCy is cool!']
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [20]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [21]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ['Oh', ',', 'really', '?', '!']
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


### Docs, spans and entities from scratch
In this exercise, you'll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

In [22]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ['I', 'like', 'David', 'Bowie']
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

I like David Bowie


In [23]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


### Data structures best practices
The code in this example is trying to analyze a text and collect all proper nouns. If the token following the proper noun is a verb, it should also be extracted. A doc object has already been created.

//Get all tokens and part-of-speech tags

pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

In [24]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['Berlin', 'is', 'a', 'nice', 'city'], spaces=[True, True, True, True, False])

print(doc.text)

Berlin is a nice city


In [25]:
# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

## Word vectors and similarity

Similarity is determined using word vectors

### Inspecting word vectors
In this exercise, you'll use a larger English model, which includes around 20.000 word vectors. Because vectors take a little longer to load, we're using a slightly compressed version of it than the one you can download with spaCy. The model is already pre-installed, and spacy has already been imported for you.

https://spacy.io/models/en

In [26]:
# Download the 'en_core_web_md' model
spacy.cli.download("en_core_web_md")
nlp = spacy.load('en_core_web_md')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [27]:
# Load the en_core_web_md model
nlp = spacy.load('en_core_web_md')

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

##you'll be using spaCy to predict similarities between documents, spans and
# tokens via the word vectors under the hood.

[-2.1689e-01 -2.5989e+00 -1.3144e+00  2.2500e+00 -4.6767e-01 -2.0695e+00
 -6.3379e-01 -4.0222e-01 -3.4022e+00 -3.6932e-01 -7.9938e-01 -1.0412e+00
  9.3756e-01  1.6070e+00  8.8330e-01 -2.8483e+00  1.3349e-01 -3.1656e+00
  8.1896e-01 -4.8113e+00  1.5655e+00  1.6665e+00 -4.7081e-01 -1.9475e+00
 -1.1779e+00 -1.3810e+00 -2.0071e+00 -2.1639e-01  9.0609e-01  1.5279e+00
  1.2587e-04 -2.9000e+00  7.6069e-01 -2.2825e+00  1.2495e-02 -1.5653e+00
  2.0052e+00 -1.7747e+00  5.9220e-01 -1.1428e+00 -1.3441e+00  3.4784e-01
  1.7492e+00  1.9086e+00  1.0600e+00  1.2965e+00  4.1431e-01  7.9416e-01
 -1.1277e+00 -1.1403e+00  7.5891e-01 -9.4419e-01  1.4413e+00 -2.2554e+00
  1.6226e-01  3.8901e-01  1.2299e-01  1.1577e+00  1.5524e+00  1.3853e+00
  1.1112e+00  7.5767e-01  3.9431e+00 -2.8506e-01 -2.1645e+00 -1.0862e+00
 -1.4973e+00 -1.2781e+00  2.4643e+00 -1.5886e+00  2.5679e-01  6.4918e-01
  1.6809e-01  5.7693e-01  3.1121e-01 -4.5278e-01 -2.7555e+00 -2.1846e+00
  4.4865e+00  2.7107e-01 -5.3831e-01  8.3013e-01  6

### Comparing similarities
In this exercise, you'll be using spaCy's similarity methods to compare Doc, Token and Span objects and get similarity scores. The medium English model is already available as the nlp object.

In [28]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8220092482601077


In [29]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.10219937562942505


In [30]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[-4:-1]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

# Once you're getting serious about developing NLP applications that leverage semantic similarity,
# you might want to train vectors on your own data, or tweak the similarity algorithm.

0.6348510384559631


## Combining models and rules

### Debugging patterns (2)
Both patterns in this exercise contain mistakes and won't match as expected. Can you fix them?

The nlp and a doc have already been created for you. If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

In [31]:
# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', [pattern1])
matcher.add('PATTERN2', [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

# Well done! For the token '_', you can match on the attribute TEXT, LOWER or even SHAPE. All of those are correct. As you can see,
# paying close attention to the tokenization is very important when working with the token-based Matcher.
# Sometimes it's much easier to just match exact strings instead and use the PhraseMatcher, which we'll get to in the next exercise.

### Efficient phrase matching
Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world.

We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES. The nlp object and a test doc have already been created and the doc.text has been printed to the shell.

In [32]:
COUNTRIES= ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

In [33]:
doc= Doc(nlp.vocab, words= ['Czech', 'Republic', 'may', 'help', 'Slovakia', 'protect', 'its', 'airspace'], spaces=[True, True, True, True, True, True, True, False])
doc

Czech Republic may help Slovakia protect its airspace

In [34]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', [*patterns])

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


### Extracting countries and relationships
In the previous exercise, you wrote a script using spaCy's PhraseMatcher to find country names in text. Let's use that country matcher on a longer text, analyze the syntax and update the document's entities with the matched countries. The nlp object has already been created.

The text is available as the variable text, the PhraseMatcher with the country patterns as the variable matcher. The Span class has already been imported.

In [35]:
text= '''
After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission to Bosnia faced "worldwide ridicule" for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew his nation\'s funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of the organization\'s effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered "systemic failure". One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization\'s history. The Millennium Summit was held in 2000 to discuss the UN\'s role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN\'s focus on promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to global needs.
'''
print(text)


After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a "false renaissance" for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations su

In [61]:
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

#Output: [('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'),
 #('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'),
  #('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]

This document is 27 tokens long.
[]


In [62]:
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE" and overwrite the doc.ents
    span = Span(doc, start, end, label='GPE')
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)

# <script.py> output:
    # Namibia --> Namibia
    # South --> South Africa
    # Cambodia --> Cambodia
    # Kuwait --> Kuwait
    # Somalia --> Somalia
    # Haiti --> Haiti
    # Mozambique --> Mozambique
    # Somalia --> Somalia
    # Rwanda --> Rwanda
    # Singapore --> Singapore
    # Sierra --> Sierra Leone
    # Afghanistan --> Afghanistan
    # Iraq --> Iraq
    # Sudan --> Sudan
    # Congo --> Congo
    # Haiti --> Haiti

This document is 27 tokens long.


#CHAPTER 3 Processing Pipelines

This chapter will show you to everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own meta data to the documents, spans and tokens.

## Processing pipelines

tagger, parser, ner, textcat

### What happens when you call nlp?
What does spaCy do when you call nlp on a string of text? The IPython shell has a pre-loaded nlp object that logs what's going on under the hood. Try processing a text with it!

doc = nlp("This is a sentence.")

1) Tokenizing text: hi

2) Calling pipeline component 'tagger' on Doc

3) Calling pipeline component 'parser' on Doc

4) Calling pipeline component 'ner' on Doc

5) Returning processed Doc

### Inspecting the pipeline
Let's inspect the small English model's pipeline!

In [39]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7e27641351e0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7e275ee39fc0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7e289a7ec890>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7e276427ecc0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7e276b03ebc0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7e2754a352a0>)]


## Custom Pipeline Components



### Use cases for custom components

these problems can be solved by custom pipeline components

computing your own values based on tokens and their attributes

adding named entities, for example based on a dictionary



### Simple components
The example shows a custom component that prints the character length of a document. Can you complete it? spacy has already been imported for you.

In [40]:

from spacy.language import Language
from spacy.tokens import Doc, Token

# Define the custom component
@Language.component("length_component")
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc

# Load the small English model and Add the component first in the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("length_component", first=True)

# Process a text
doc = nlp("This is a sentence")

This document is 4 tokens long.


### Complex components
In this exercise, you'll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents.

A PhraseMatcher with the animal patterns has already been created as the variable matcher. The small English model is available as the variable nlp. The Span object has already been imported for you.

In [41]:
animal_patterns= ['Golden Retriever', 'cat', 'turtle', 'Rattus norvegicus']


doc= Doc(nlp.vocab, words= ['I', 'have', 'a', 'cat', 'and', 'a', 'Golden', 'Retriever'], spaces=[True, True, True, True, True, True, True, False])
doc

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(animal_patterns))
matcher.add('ANIMAL', [*patterns])

This document is 2 tokens long.
This document is 1 tokens long.
This document is 1 tokens long.
This document is 2 tokens long.


In [42]:
# Define the custom component
@Language.component("animal_component")
def animal_component(doc):
    # Create a Span for each match and assign the label 'ANIMAL'
    # and overwrite the doc.ents with the matched spans
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline after the 'ner' component
nlp.add_pipe('animal_component', after='ner')

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

This document is 8 tokens long.
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## Extension attributes


### Setting extension attributes (1)
Let's practice setting some extension attributes. The nlp object has already been created for you and the Doc, Token and Span classes are already imported.

Remember that if you run your code more than once, you might see an error message that the extension already exists. That's because DataCamp will re-run your code in the same session. To solve this, you can set force=True on set_extension, or reload to start a new Python session. None of this will affect the answer you submit.

In [43]:
# Use Token.set_extension to register is_country (default False).
# Update it for "Spain" and print it for all tokens.

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

This document is 5 tokens long.
[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [44]:
# Use Token.set_extension to register 'reversed' (getter function get_reversed).
# print its value for each token.

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]

# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

This document is 9 tokens long.
reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


### Setting extension attributes (2)
Let's try setting some more complex attributes using getters and method extensions. The nlp object has already been created for you and the Doc, Token and Span classes are already imported.

Remember that if you run your code more than once, you might see an error message that the extension already exists. That's because DataCamp will re-run your code in the same session. To solve this, you can set force=True on set_extension, or reload to start a new Python session. None of this will affect the answer you submit.

In [45]:
# Complete the has_number function
# Use Doc.set_extension to register 'has_number' (getter get_has_number) and print its value.

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

This document is 9 tokens long.
has_number: True


In [46]:
# Use Span.set_extension to register 'to_html' (method to_html).
# Call it on doc[0:2] with the tag 'strong'.

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

This document is 8 tokens long.
<strong>Hello world</strong>


### Entities and extensions
In this exercise, you'll combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

The Span class is already imported and the nlp object has been created for you.

In [47]:
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url, force=True)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

# You now have a pipeline component that uses named entities predicted by the model to generate Wikipedia URLs
# and adds them as a custom attribute. Try opening the link in your browser to see what happens!



This document is 26 tokens long.


### Components with extensions
Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.

The nlp object has already been created and the Span class is already imported. A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable capitals.

In [48]:
capitals= {'Afghanistan': 'Kabul',
 'Albania': 'Tirana',
 'Algeria': 'Algiers',
 'American Samoa': 'Pago Pago',
 'Andorra': 'Andorra la Vella',
 'Angola': 'Luanda',
 'Anguilla': 'The Valley',
 'Antarctica': '',
 'Antigua and Barbuda': "Saint John's",
 'Argentina': 'Buenos Aires',
 'Armenia': 'Yerevan',
 'Aruba': 'Oranjestad',
 'Australia': 'Canberra',
 'Austria': 'Vienna',
 'Azerbaijan': 'Baku',
 'Bahamas': 'Nassau',
 'Bahrain': 'Manama',
 'Bangladesh': 'Dhaka',
 'Barbados': 'Bridgetown',
 'Belarus': 'Minsk',
 'Belgium': 'Brussels',
 'Belize': 'Belmopan',
 'Benin': 'Porto-Novo',
 'Bermuda': 'Hamilton',
 'Bhutan': 'Thimphu',
 'Bolivia (Plurinational State of)': 'Sucre',
 'Bonaire, Sint Eustatius and Saba': 'Kralendijk',
 'Bosnia and Herzegovina': 'Sarajevo',
 'Botswana': 'Gaborone',
 'Bouvet Island': '',
 'Brazil': 'Brasília',
 'British Indian Ocean Territory': 'Diego Garcia',
 'Brunei Darussalam': 'Bandar Seri Begawan',
 'Bulgaria': 'Sofia',
 'Burkina Faso': 'Ouagadougou',
 'Burundi': 'Bujumbura',
 'Cabo Verde': 'Praia',
 'Cambodia': 'Phnom Penh',
 'Cameroon': 'Yaoundé',
 'Canada': 'Ottawa',
 'Cayman Islands': 'George Town',
 'Central African Republic': 'Bangui',
 'Chad': "N'Djamena",
 'Chile': 'Santiago',
 'China': 'Beijing',
 'Christmas Island': 'Flying Fish Cove',
 'Cocos (Keeling) Islands': 'West Island',
 'Colombia': 'Bogotá',
 'Comoros': 'Moroni',
 'Congo': 'Brazzaville',
 'Congo (Democratic Republic of the)': 'Kinshasa',
 'Cook Islands': 'Avarua',
 'Costa Rica': 'San José',
 'Croatia': 'Zagreb',
 'Cuba': 'Havana',
 'Curaçao': 'Willemstad',
 'Cyprus': 'Nicosia',
 'Czech Republic': 'Prague',
 "Côte d'Ivoire": 'Yamoussoukro',
 'Denmark': 'Copenhagen',
 'Djibouti': 'Djibouti',
 'Dominica': 'Roseau',
 'Dominican Republic': 'Santo Domingo',
 'Ecuador': 'Quito',
 'Egypt': 'Cairo',
 'El Salvador': 'San Salvador',
 'Equatorial Guinea': 'Malabo',
 'Eritrea': 'Asmara',
 'Estonia': 'Tallinn',
 'Ethiopia': 'Addis Ababa',
 'Falkland Islands (Malvinas)': 'Stanley',
 'Faroe Islands': 'Tórshavn',
 'Fiji': 'Suva',
 'Finland': 'Helsinki',
 'France': 'Paris',
 'French Guiana': 'Cayenne',
 'French Polynesia': 'Papeetē',
 'French Southern Territories': 'Port-aux-Français',
 'Gabon': 'Libreville',
 'Gambia': 'Banjul',
 'Georgia': 'Tbilisi',
 'Germany': 'Berlin',
 'Ghana': 'Accra',
 'Gibraltar': 'Gibraltar',
 'Greece': 'Athens',
 'Greenland': 'Nuuk',
 'Grenada': "St. George's",
 'Guadeloupe': 'Basse-Terre',
 'Guam': 'Hagåtña',
 'Guatemala': 'Guatemala City',
 'Guernsey': 'St. Peter Port',
 'Guinea': 'Conakry',
 'Guinea-Bissau': 'Bissau',
 'Guyana': 'Georgetown',
 'Haiti': 'Port-au-Prince',
 'Heard Island and McDonald Islands': '',
 'Holy See': 'Rome',
 'Honduras': 'Tegucigalpa',
 'Hong Kong': 'City of Victoria',
 'Hungary': 'Budapest',
 'Iceland': 'Reykjavík',
 'India': 'New Delhi',
 'Indonesia': 'Jakarta',
 'Iran (Islamic Republic of)': 'Tehran',
 'Iraq': 'Baghdad',
 'Ireland': 'Dublin',
 'Isle of Man': 'Douglas',
 'Israel': 'Jerusalem',
 'Italy': 'Rome',
 'Jamaica': 'Kingston',
 'Japan': 'Tokyo',
 'Jersey': 'Saint Helier',
 'Jordan': 'Amman',
 'Kazakhstan': 'Astana',
 'Kenya': 'Nairobi',
 'Kiribati': 'South Tarawa',
 "Korea (Democratic People's Republic of)": 'Pyongyang',
 'Korea (Republic of)': 'Seoul',
 'Kuwait': 'Kuwait City',
 'Kyrgyzstan': 'Bishkek',
 "Lao People's Democratic Republic": 'Vientiane',
 'Latvia': 'Riga',
 'Lebanon': 'Beirut',
 'Lesotho': 'Maseru',
 'Liberia': 'Monrovia',
 'Libya': 'Tripoli',
 'Liechtenstein': 'Vaduz',
 'Lithuania': 'Vilnius',
 'Luxembourg': 'Luxembourg',
 'Macao': '',
 'Macedonia (the former Yugoslav Republic of)': 'Skopje',
 'Madagascar': 'Antananarivo',
 'Malawi': 'Lilongwe',
 'Malaysia': 'Kuala Lumpur',
 'Maldives': 'Malé',
 'Mali': 'Bamako',
 'Malta': 'Valletta',
 'Marshall Islands': 'Majuro',
 'Martinique': 'Fort-de-France',
 'Mauritania': 'Nouakchott',
 'Mauritius': 'Port Louis',
 'Mayotte': 'Mamoudzou',
 'Mexico': 'Mexico City',
 'Micronesia (Federated States of)': 'Palikir',
 'Moldova (Republic of)': 'Chișinău',
 'Monaco': 'Monaco',
 'Mongolia': 'Ulan Bator',
 'Montenegro': 'Podgorica',
 'Montserrat': 'Plymouth',
 'Morocco': 'Rabat',
 'Mozambique': 'Maputo',
 'Myanmar': 'Naypyidaw',
 'Namibia': 'Windhoek',
 'Nauru': 'Yaren',
 'Nepal': 'Kathmandu',
 'Netherlands': 'Amsterdam',
 'New Caledonia': 'Nouméa',
 'New Zealand': 'Wellington',
 'Nicaragua': 'Managua',
 'Niger': 'Niamey',
 'Nigeria': 'Abuja',
 'Niue': 'Alofi',
 'Norfolk Island': 'Kingston',
 'Northern Mariana Islands': 'Saipan',
 'Norway': 'Oslo',
 'Oman': 'Muscat',
 'Pakistan': 'Islamabad',
 'Palau': 'Ngerulmud',
 'Palestine, State of': 'Ramallah',
 'Panama': 'Panama City',
 'Papua New Guinea': 'Port Moresby',
 'Paraguay': 'Asunción',
 'Peru': 'Lima',
 'Philippines': 'Manila',
 'Pitcairn': 'Adamstown',
 'Poland': 'Warsaw',
 'Portugal': 'Lisbon',
 'Puerto Rico': 'San Juan',
 'Qatar': 'Doha',
 'Republic of Kosovo': 'Pristina',
 'Romania': 'Bucharest',
 'Russian Federation': 'Moscow',
 'Rwanda': 'Kigali',
 'Réunion': 'Saint-Denis',
 'Saint Barthélemy': 'Gustavia',
 'Saint Helena, Ascension and Tristan da Cunha': 'Jamestown',
 'Saint Kitts and Nevis': 'Basseterre',
 'Saint Lucia': 'Castries',
 'Saint Martin (French part)': 'Marigot',
 'Saint Pierre and Miquelon': 'Saint-Pierre',
 'Saint Vincent and the Grenadines': 'Kingstown',
 'Samoa': 'Apia',
 'San Marino': 'City of San Marino',
 'Sao Tome and Principe': 'São Tomé',
 'Saudi Arabia': 'Riyadh',
 'Senegal': 'Dakar',
 'Serbia': 'Belgrade',
 'Seychelles': 'Victoria',
 'Sierra Leone': 'Freetown',
 'Singapore': 'Singapore',
 'Sint Maarten (Dutch part)': 'Philipsburg',
 'Slovakia': 'Bratislava',
 'Slovenia': 'Ljubljana',
 'Solomon Islands': 'Honiara',
 'Somalia': 'Mogadishu',
 'South Africa': 'Pretoria',
 'South Georgia and the South Sandwich Islands': 'King Edward Point',
 'South Sudan': 'Juba',
 'Spain': 'Madrid',
 'Sri Lanka': 'Colombo',
 'Sudan': 'Khartoum',
 'Suriname': 'Paramaribo',
 'Svalbard and Jan Mayen': 'Longyearbyen',
 'Swaziland': 'Lobamba',
 'Sweden': 'Stockholm',
 'Switzerland': 'Bern',
 'Syrian Arab Republic': 'Damascus',
 'Taiwan': 'Taipei',
 'Tajikistan': 'Dushanbe',
 'Tanzania, United Republic of': 'Dodoma',
 'Thailand': 'Bangkok',
 'Timor-Leste': 'Dili',
 'Togo': 'Lomé',
 'Tokelau': 'Fakaofo',
 'Tonga': "Nuku'alofa",
 'Trinidad and Tobago': 'Port of Spain',
 'Tunisia': 'Tunis',
 'Turkey': 'Ankara',
 'Turkmenistan': 'Ashgabat',
 'Turks and Caicos Islands': 'Cockburn Town',
 'Tuvalu': 'Funafuti',
 'Uganda': 'Kampala',
 'Ukraine': 'Kiev',
 'United Arab Emirates': 'Abu Dhabi',
 'United Kingdom of Great Britain and Northern Ireland': 'London',
 'United States Minor Outlying Islands': '',
 'United States of America': 'Washington, D.C.',
 'Uruguay': 'Montevideo',
 'Uzbekistan': 'Tashkent',
 'Vanuatu': 'Port Vila',
 'Venezuela (Bolivarian Republic of)': 'Caracas',
 'Viet Nam': 'Hanoi',
 'Virgin Islands (British)': 'Road Town',
 'Virgin Islands (U.S.)': 'Charlotte Amalie',
 'Wallis and Futuna': 'Mata-Utu',
 'Western Sahara': 'El Aaiún',
 'Yemen': "Sana'a",
 'Zambia': 'Lusaka',
 'Zimbabwe': 'Harare',
 'Åland Islands': 'Mariehamn'}

In [51]:
# cause i ran it more then once
nlp.remove_pipe("countries_component")

# Define the custom component
@Language.component("countries_component")
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE')
                for match_id, start, end in matcher(doc)]
    return doc


nlp.add_pipe("countries_component", before="ner")

# Register capital and getter that looks up the span text in country capitals
Span.set_extension('capital', getter=lambda span: capitals.get(span.text), force=True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

# <script.py> output:
#    [('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]

This document is 8 tokens long.
[]


## Scaling and performance


### Processing streams
In this exercise, you'll be using nlp.pipe for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

In [52]:
TEXTS= ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

In [53]:
# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

This document is 6 tokens long.
This document is 27 tokens long.
This document is 6 tokens long.
This document is 13 tokens long.
This document is 18 tokens long.
This document is 15 tokens long.
This document is 18 tokens long.
['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


In [54]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

This document is 6 tokens long.
This document is 27 tokens long.
This document is 6 tokens long.
This document is 13 tokens long.
This document is 18 tokens long.
This document is 15 tokens long.
This document is 18 tokens long.
() () () () () () ()


In [55]:
#original
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = [nlp(person) for person in people]


# BETTER
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))

This document is 2 tokens long.
This document is 2 tokens long.
This document is 2 tokens long.
This document is 2 tokens long.
This document is 2 tokens long.
This document is 2 tokens long.


### Processing data with context
In this exercise, you'll be using custom attributes to add author and book meta information to quotes.

A list of (text, context) examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'. The nlp object has already been created for you.

In [56]:
# Import the Doc class
from spacy.tokens import Doc

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None)

In [57]:
DATA= [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

In [58]:
# Import the Doc class and register the extensions 'author' and 'book'
from spacy.tokens import Doc
Doc.set_extension('book', default=None, force=True)
Doc.set_extension('author', default=None, force=True)

for doc, context in nlp.pipe(DATA, as_tuples=True):

    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']

    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

This document is 23 tokens long.
This document is 23 tokens long.
This document is 14 tokens long.
This document is 64 tokens long.
This document is 16 tokens long.
This document is 13 tokens long.
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '19

### Selective processing
In this exercise, you'll use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text. The small English model is already loaded in as the nlp object.

In [59]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [60]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)


#<script.py> output:
#    (American, College Park, Georgia)


This document is 27 tokens long.
()




# CHAPTER 4 Training a neural network model

In this chapter, you'll learn how to update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.


## Training and updating models

### Purpose of training
While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost always want to fine-tune them with more examples. You can do this by training them with more labelled data.

What does training not help with?

Discover pattern in unlabel data

### Creating training data (1)
spaCy's rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

The nlp object has already been created for you and the Matcher is available as the variable matcher.

In [63]:
TEXTS= ['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']


In [76]:
# Initialize the PhraseMatcher
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', [pattern1])
matcher.add('GADGET', [pattern2])



### Creating training data (2)
Let's use the match patterns we've created in the previous exercise to bootstrap a set of training examples. The nlp object has already been created for you and the Matcher with the added patterns pattern1 and pattern2 is available as the variable matcher. A list of sentences is available as the variable TEXTS.

In [78]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)

    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for doc, start, end in matches]
    print(doc.text, entities)

#OUTPUT:
# How to preorder the iPhone X [(4, 6, 'GADGET')]
# iPhone X is coming [(0, 2, 'GADGET')]
# Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET')]
# The iPhone 8 reviews are here [(1, 3, 'GADGET')]
# Your iPhone goes up to 11 today []
# I need a new phone! Any tips? []

In [80]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]

    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep='\n')

#<script.py> output:
    # ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
    # ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
    # ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
    # ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
    # ('Your iPhone goes up to 11 today', {'entities': []})
    # ('I need a new phone! Any tips?', {'entities': []})

## The training loop


### Setting up the pipeline
In this exercise, you'll prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text – for exampe, "iPhone X".

spacy has already been imported for you.

In [82]:
# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

### Building a training loop
Let's write a simple training loop from scratch!

The pipeline you've created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label 'GADGET'.

The small set of labelled examples that you've created previously is available as the global variable TRAINING_DATA. To see the examples, you can print them in your script or in the IPython shell. spacy and random have already been imported for you.

In [None]:
TRAINING_DATA= [('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]}),
 ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}),
 ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}),
 ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}),
 ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}),
 ('I need a new phone! Any tips?', {'entities': []})]

In [None]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    print('random:', random.shuffle(TRAINING_DATA))
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        print('batch:', batch)
        texts = [text for text, entities in batch]
        print('texts:', texts)
        annotations = [entities for text, entities in batch]
        print('annotations:', annotations)

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

# <script.py> output:
#     random: None
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'How to preorder the iPhone X']
#     annotations: [{'entities': []}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 11.999999642372131}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 22.53183114528656}
#     batch: [('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})]
#     texts: ['The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today']
#     annotations: [{'entities': [(4, 12, 'GADGET')]}, {'entities': [(5, 11, 'GADGET')]}]
#     {'ner': 31.701135993003845}
#     random: None
#     batch: [('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})]
#     texts: ['Your iPhone goes up to 11 today', 'iPhone X is coming']
#     annotations: [{'entities': [(5, 11, 'GADGET')]}, {'entities': [(0, 8, 'GADGET')]}]
#     {'ner': 6.115316092967987}
#     batch: [('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['The iPhone 8 reviews are here', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(4, 12, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 11.430656850337982}
#     batch: [('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}), ('I need a new phone! Any tips?', {'entities': []})]
#     texts: ['Should I pay $1,000 for the iPhone X?', 'I need a new phone! Any tips?']
#     annotations: [{'entities': [(28, 36, 'GADGET')]}, {'entities': []}]
#     {'ner': 16.42770153284073}
#     random: None
#     batch: [('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('I need a new phone! Any tips?', {'entities': []})]
#     texts: ['Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']
#     annotations: [{'entities': [(5, 11, 'GADGET')]}, {'entities': []}]
#     {'ner': 2.3262053430080414}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 5.552737276535481}
#     batch: [('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['The iPhone 8 reviews are here', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(4, 12, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 9.060084632714279}
#     random: None
#     batch: [('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})]
#     texts: ['Your iPhone goes up to 11 today', 'The iPhone 8 reviews are here']
#     annotations: [{'entities': [(5, 11, 'GADGET')]}, {'entities': [(4, 12, 'GADGET')]}]
#     {'ner': 3.2921266618941445}
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': []}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 4.31446285227139}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 7.411704116631881}
#     random: None
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'Your iPhone goes up to 11 today']
#     annotations: [{'entities': []}, {'entities': [(5, 11, 'GADGET')]}]
#     {'ner': 0.8126044412201736}
#     batch: [('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})]
#     texts: ['Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here']
#     annotations: [{'entities': [(28, 36, 'GADGET')]}, {'entities': [(4, 12, 'GADGET')]}]
#     {'ner': 5.228105610149214}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 8.659203528804937}
#     random: None
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'Your iPhone goes up to 11 today']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(5, 11, 'GADGET')]}]
#     {'ner': 2.438779136398807}
#     batch: [('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]}), ('I need a new phone! Any tips?', {'entities': []})]
#     texts: ['How to preorder the iPhone X', 'I need a new phone! Any tips?']
#     annotations: [{'entities': [(20, 28, 'GADGET')]}, {'entities': []}]
#     {'ner': 3.4826324112946168}
#     batch: [('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['The iPhone 8 reviews are here', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': [(4, 12, 'GADGET')]}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 6.446082263835706}
#     random: None
#     batch: [('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['Should I pay $1,000 for the iPhone X?', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(28, 36, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 1.4833969387691468}
#     batch: [('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})]
#     texts: ['Your iPhone goes up to 11 today', 'The iPhone 8 reviews are here']
#     annotations: [{'entities': [(5, 11, 'GADGET')]}, {'entities': [(4, 12, 'GADGET')]}]
#     {'ner': 2.7859302376164123}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('I need a new phone! Any tips?', {'entities': []})]
#     texts: ['iPhone X is coming', 'I need a new phone! Any tips?']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': []}]
#     {'ner': 3.2335402196100063}
#     random: None
#     batch: [('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['How to preorder the iPhone X', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': [(20, 28, 'GADGET')]}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 0.23502754060609732}
#     batch: [('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})]
#     texts: ['Your iPhone goes up to 11 today', 'The iPhone 8 reviews are here']
#     annotations: [{'entities': [(5, 11, 'GADGET')]}, {'entities': [(4, 12, 'GADGET')]}]
#     {'ner': 2.774232676034444}
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'iPhone X is coming']
#     annotations: [{'entities': []}, {'entities': [(0, 8, 'GADGET')]}]
#     {'ner': 2.8161797297798827}
#     random: None
#     batch: [('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})]
#     texts: ['The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today']
#     annotations: [{'entities': [(4, 12, 'GADGET')]}, {'entities': [(5, 11, 'GADGET')]}]
#     {'ner': 2.651838634375359}
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'iPhone X is coming']
#     annotations: [{'entities': []}, {'entities': [(0, 8, 'GADGET')]}]
#     {'ner': 2.655443485509177}
#     batch: [('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]}), ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})]
#     texts: ['How to preorder the iPhone X', 'Should I pay $1,000 for the iPhone X?']
#     annotations: [{'entities': [(20, 28, 'GADGET')]}, {'entities': [(28, 36, 'GADGET')]}]
#     {'ner': 2.6557519075226947}
#     random: None
#     batch: [('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]}), ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})]
#     texts: ['Should I pay $1,000 for the iPhone X?', 'How to preorder the iPhone X']
#     annotations: [{'entities': [(28, 36, 'GADGET')]}, {'entities': [(20, 28, 'GADGET')]}]
#     {'ner': 6.508660462145599e-05}
#     batch: [('I need a new phone! Any tips?', {'entities': []}), ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})]
#     texts: ['I need a new phone! Any tips?', 'The iPhone 8 reviews are here']
#     annotations: [{'entities': []}, {'entities': [(4, 12, 'GADGET')]}]
#     {'ner': 0.00017008849332811327}
#     batch: [('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]}), ('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})]
#     texts: ['iPhone X is coming', 'Your iPhone goes up to 11 today']
#     annotations: [{'entities': [(0, 8, 'GADGET')]}, {'entities': [(5, 11, 'GADGET')]}]
#     {'ner': 1.6183852915911352}


# The numbers printed to the IPython shell represent the loss on each iteration, the amount of work left for the optimizer. The lower the number, the better.
# In real life, you normally want to use a lot more data than this, ideally at least a few hundred or a few thousand examples.

### Exploring the model
Let's see how the model performs on unseen data! To speed things up a little, here's a trained model for the label 'GADGET', using the examples from the previous exercise, plus a few hundred more. The loaded model is already available as the nlp object. A list of test texts is available as TEST_DATA.

In [84]:
TEST_DATA= ['Apple is slowing down the iPhone 8 and iPhone X - how to stop it',
 "I finally understand what the iPhone X 'notch' is for",
 'Everything you need to know about the Samsung Galaxy S9',
 'Looking to compare iPad models? Here’s how the 2018 lineup stacks up',
 'The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple',
 'what is the cheapest ipad, especially ipad pro???',
 'Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics']

In [85]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')

#<script.py> output:
    # Apple is slowing down the iPhone 8 and iPhone X - how to stop it
    # (iPhone 8, iPhone X)


    # I finally understand what the iPhone X 'notch' is for
    # (iPhone X,)


    # Everything you need to know about the Samsung Galaxy S9
    # (Samsung Galaxy,)


    # Looking to compare iPad models? Here’s how the 2018 lineup stacks up
    # (iPad,)


    # The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
    # (iPhone 8, iPhone 8)


    # what is the cheapest ipad, especially ipad pro???
    # (ipad, ipad)


    # Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
    # (Samsung Galaxy,)

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
() 


I finally understand what the iPhone X 'notch' is for
() 


Everything you need to know about the Samsung Galaxy S9
() 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
() 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
() 


what is the cheapest ipad, especially ipad pro???
() 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
() 




## Training best practices

Problem 1: Models can forget things

Problem 2: Models cant learn everything

### Good data vs. bad data
Here's an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

In [87]:
TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'TOURIST_DESTINATION')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'TOURIST_DESTINATION')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': []}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'TOURIST_DESTINATION')]})
]

#print(*TRAINING_DATA, sep='\n')

# Rewrite the TRAINING_DATA to only use the label GPE (cities, states, countries) instead of TOURIST_DESTINATION.

TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'GPE')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'GPE')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'GPE')]})
]

print(*TRAINING_DATA, sep='\n')

('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'GPE')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'GPE')]})
("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'GPE')]})


### Training multiple labels
Here's a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you'll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, Brat
http://brat.nlplab.org/

, a popular open-source solution, or Prodigy
https://prodi.gy/

, our own annotation tool that integrates with spaCy.

After this exercise you will be nearly done with the course! If you enjoyed it, feel free to send Ines a thank you via Twitter - she'll appreciate it! Tweet to Ines

In [88]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities",
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),

    ("PewDiePie smashes YouTube record",
     {'entities': [(18, 25, 'WEBSITE')]}),

    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
     {'entities': [(0, 6, 'WEBSITE')]}),
    # And so on...
]

In [89]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities",
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),

    ("PewDiePie smashes YouTube record",
     {'entities': [(0,9, 'PERSON'), (18, 25, 'WEBSITE')]}),

    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
     {'entities': [(0, 6, 'WEBSITE'), (15, 29, 'PERSON')]}),
    # And so on...
]