# Chapter 1: Finding words, phrases, names and concepts

In [1]:
import numpy as np
import pandas as pd

### Getting Started
Let's get started and try out spaCy! In this exercise, you'll be able to try out some of the 45+ available languages.

This course introduces a lot of new concepts, so if you ever need a quick refresher, download the spaCy Cheat Sheet and keep it handy!

#### Instructions 1/3
35 XP
1
Import the English class from spacy.lang.en and create the nlp object.
Create a doc and print its text.

Show Answer (-35 XP)
2
Import the German class from spacy.lang.de and create the nlp object.
Create a doc and print its text.
3
Import the Spanish class from spacy.lang.es and create the nlp object.
Create a doc and print its text.

In [2]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


In [3]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [4]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


---

### Documents, spans and tokens
When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you'll learn more about the Doc, as well as its views Token and Span.

#### Instructions 1/2

Import the English language class and create the nlp object.
Process the text and instantiate a Doc object in the variable doc.
Select the first token of the Doc and print its text.

In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [6]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


---

### Lexical attributes

In this example, you'll use spaCy's Doc and Token objects, and lexical attributes to find percentages in a text. You'll be looking for two subsequent tokens: a number and a percent sign. The English nlp object has already been created.

#### Instructions

Use the like_num token attribute to check whether a token in the doc resembles a number.
Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
Check whether the next token's text attribute is a percent sign "%".

In [7]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


---

### Loading models

Let's start by loading a model. spacy is already imported.

#### Instructions 1/2

Use spacy.load to load the small English model 'en_core_web_sm'.
Process the text and print the document text.

Take Hint (-15 XP)
2
Use spacy.load to load the small German model 'de_core_news_sm'.
Process the text and print the document text.

In [8]:
import spacy

!python -m spacy download en_core_web_sm 
!python -m spacy download de_core_news_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting de-core-news-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.3.0/de_core_news_sm-3.3.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [9]:
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [10]:
# Load the 'de_core_news_sm' model – spaCy is already imported
nlp = spacy.load('de_core_news_sm')

text = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht


---

### Predicting linguistic annotations

You'll now get to try one of spaCy's pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! The small English model is already available as the variable nlp.

To find out what a tag or label means, you can call spacy.explain in the IPython shell. For example: spacy.explain('PROPN') or spacy.explain('GPE').

#### Instructions 1/2

Process the text with the nlp object and create a doc.
For each token, print the token text, the token's .pos_ (part-of-speech tag) and the token's .dep_ (dependency label).

In [11]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PROPN     pnc       
’s          PROPN     ROOT      
official    ADV       mnr       
:           PUNCT     punct     
Apple       X         pnc       
is          X         uc        
the         X         uc        
first       X         pnc       
U.S.        X         app       
public      X         punct     
company     X         uc        
to          X         uc        
reach       X         uc        
a           PROPN     pnc       
$           PROPN     ag        
1           NUM       pnc       
trillion    NOUN      ROOT      
market      PROPN     pnc       
value       PROPN     nk        


In [12]:
spacy.explain('PROPN')

'proper noun'

In [13]:
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


---

### Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you're processing. Let's take a look at an example. The small English model is available as the variable nlp.

#### Instructions 1/2

Process the text with the nlp object.
Iterate over the entities with the iterator ent and print the entity text and label.



In [14]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

Apple ORG


In [15]:
# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Missing entity: iPhone X


---

### Using the Matcher

Let's try spaCy's rule-based Matcher. You'll be using the example from the previous exercise and write a pattern that can match the phrase "iPhone X" in the text. The nlp object and a processed doc are already available.

#### Instructions 1/3

Import the Matcher from spacy.matcher.
Initialize it with the nlp object's shared vocab.

In [16]:
# Import the Matcher and initialize it with the shared vocabulary
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_X_PATTERN', [pattern], on_match = None)

# Use the matcher on the doc
matches = matcher(doc)
print('Matches:', [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


---

Below is an example of regex in Spacy's matcher. Here is the [link](https://stackoverflow.com/questions/63280191/how-can-i-use-spacy-matcher-or-phrasematcher-class-for-the-extracting-the-sequ)

In [17]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)

colors=['red','gray','black','white','brown']
animals=['fox','bear','hare','squirrel','wolf']
pattern = [
   {'TEXT': {"REGEX": 
             fr"(?i)^({'|'.join(colors)})$"}},
   {'TEXT': {"REGEX": fr"(?i)^({'|'.join(animals)})$"}}
]
matcher.add("ColoredAnimals", [pattern], on_match = None)

doc = nlp("Hello, red fox! Hello Black Hare! What's up whItE sQuirrel, brown wolf and gray bear!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

6184782843368750095 ColoredAnimals 2 4 red fox
6184782843368750095 ColoredAnimals 6 8 Black Hare
6184782843368750095 ColoredAnimals 12 14 whItE sQuirrel
6184782843368750095 ColoredAnimals 15 17 brown wolf
6184782843368750095 ColoredAnimals 18 20 gray bear


---

### Writing match patterns

In this exercise, you'll practice writing more complex match patterns using different token attributes and operators. A matcher is already initialized and available as the variable matcher.

#### Instructions 1/3

Write one pattern that only matches mentions of the full iOS versions: "iOS 7", "iOS 11" and "iOS 10".

__2__

Write one pattern that only matches forms of "download" (tokens with the lemma "download"), followed by a token with the part-of-speech tag 'PROPN' (proper noun).

__3__

Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).

In [18]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', [pattern], on_match = None)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [19]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when \
I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I \
used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern], on_match = None)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [20]:
doc = nlp("Features of the app include a beautiful design, smart search, \
automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern], on_match = None)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


---

# Chapter 2: Large-scale data analysis with spaCy

### Strings to hashes

The nlp object has already been created for you.

#### Instructions 1/2

Look up the string "cat" in nlp.vocab.strings to get the hash.
Look up the hash to get back the string.

2

Look up the string label "PERSON" in nlp.vocab.strings to get the hash.
Look up the hash to get back the string.

In [21]:
# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings['cat']
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [22]:
# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings['PERSON']
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


---

### Vocab, hashes and lexemes

Why does this code throw an error?

```
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()
```

```
# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

# Look up the ID for 'Bowie' in the vocab
print(nlp_de.vocab.strings[bowie_id])
```
The English language class is already available as the nlp object.

#### Instructions

The string 'Bowie' isn't present in the German vocab, so the hash can't be resolved in the string store.

'Bowie' is not a regular word in the English or German dictionary, so it can't be hashed.

nlp_de is not a valid name. The vocab can only be shared if the nlp objects have the same name.

In [23]:
from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)
print('')

# Look up the ID for 'Bowie' in the vocab
try:
    print(nlp_de.vocab.strings[bowie_id])
except KeyError: # created try except block so Jupyter Notebook would not stop working.
    print("[E018] Can't retrieve string for hash '2644858412616767388'. This usually refers to an issue with the `Vocab` or `StringStore`.'")

2644858412616767388

[E018] Can't retrieve string for hash '2644858412616767388'. This usually refers to an issue with the `Vocab` or `StringStore`.'


---

### Creating a Doc

Let's create some Doc objects from scratch! The nlp object has already been created for you.

By the way, if you haven't downloaded it already, check out the spaCy Cheat Sheet. It includes an overview of the most important concepts and methods and might come in handy if you ever need a quick refresher!

#### Instructions 1/3

Import the Doc from spacy.tokens.
Create a Doc from the words and spaces. Don't forget to pass in the vocab!

Take Hint (-10 XP)
2
Import the Doc from spacy.tokens.
Create a Doc from the words and spaces. Don't forget to pass in the vocab!
3
Import the Doc from spacy.tokens.
Complete the words and spaces to match the desired text and create a doc.

In [24]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [25]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [26]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ['Oh', ',', 'really', '?', '!']
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


---

### Docs, spans and entities from scratch

In this exercise, you'll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

#### Instructions 1/3

Import the Doc and Span classes from spacy.tokens.
Use the Doc class directly to create a doc from the words and spaces.

In [27]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ['I', 'like', 'David', 'Bowie']
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words = words, spaces = spaces)
print(doc.text)

I like David Bowie


In [28]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')
print(span.text, span.label_)

David Bowie PERSON


In [29]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


---

### Data structures best practices

The code in this example is trying to analyze a text and collect all proper nouns. If the token following the proper noun is a verb, it should also be extracted. A doc object has already been created.

```python
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')
```

#### Instructions 1/2

Question
Why is the code bad?

Possible Answers

- The tokens in the result should be converted back to Token objects. This will let you reuse them in spaCy.
- It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.
- pos_ is the wrong attribute to use for extracting proper nouns. You should use tag_ and the 'NNP' and 'NNS' labels instead.

In [30]:
nlp = spacy.load('en_core_web_sm')
text = 'Berlin is a nice city'

doc = nlp(text)

In [31]:
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

In [32]:
# as you can see it is a list of positional tags.
print(pos_tags)

['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN']


Re-write it!

In [33]:
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ in ('VERB', 'AUX'): # somehow 'be' is categorized as AUX contrary to class example
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!


---

### Inspecting word vectors

In this exercise, you'll use a larger English model, which includes around 20.000 word vectors. Because vectors take a little longer to load, we're using a slightly compressed version of it than the one you can download with spaCy. The model is already pre-installed, and spacy has already been imported for you.

#### Instructions

Load the medium 'en_core_web_md' model with word vectors.
Print the vector for "bananas" using the token.vector attribute.

In [34]:
!python -m spacy download en_core_web_md;

Collecting en-core-web-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [35]:
# Load the en_core_web_md model
nlp = spacy.load('en_core_web_md')

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407   -0.13926    0

---

### Comparing similarities

In this exercise, you'll be using spaCy's similarity methods to compare Doc, Token and Span objects and get similarity scores. The medium English model is already available as the nlp object.

#### Instructions 1/3

- Use the doc.similarity method to compare doc1 to doc2 and print the result.
- Use the token.similarity method to compare token1 to token2 and print the result.
- Create spans for "great restaurant"/"really nice bar".
- Use span.similarity to compare them and print the result.

In [36]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8456855743724329


In [37]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books" 
similarity = token1.similarity(token2)
print(similarity)

0.18317238986492157


In [38]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[-4:-1]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.7541285157203674


---

### Debugging patterns (1)

Why does this pattern not match the tokens "Silicon Valley" in the doc?

`pattern = [{'LOWER': 'silicon'}, {'TEXT': ' '}, {'LOWER': 'valley'}]`

`doc = nlp("Can Silicon Valley workers rein in big tech from within?")`

You can try it out in your IPython shell. The matcher with the added pattern and the doc are already created.

#### Instructions

The tokens "Silicon" and "Valley" are not lowercase, so the 'LOWER' attribute won't match.

The tokenizer doesn't create tokens for single spaces, so there's no token with the value ' ' in between.

The tokens are missing an operator 'OP' to indicate that they should be matched exactly once.

In [39]:
matcher = Matcher(nlp.vocab)

doc = nlp("Can Silicon Valley workers rein in big tech from within?")

pattern = [{'LOWER': 'silicon'}, {'TEXT': ' '}, {'LOWER': 'valley'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern], on_match = None)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 0


This does not print anything at all. Maybe it is `LOWER` attribute. We will test this now.

The `LOWER` attribute in the pattern describes tokens whose lowercase form matches a given value. So {'LOWER': 'valley'} will match tokens like "Valley", "VALLEY", "valley" etc.

In [40]:
matcher = Matcher(nlp.vocab)

doc = nlp("Can Silicon Valley workers rein in big tech from within?")

pattern = [{'LOWER': 'silicon'}, {'LOWER': 'valley'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern], on_match = None)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 1
Match found: Silicon Valley


The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.

---

### Debugging patterns (2)

Both patterns in this exercise contain mistakes and won't match as expected. Can you fix them?

The nlp and a doc have already been created for you. If you get stuck, try printing the tokens in the doc to see how the text will be split and adjust the pattern so that each dictionary represents one token.

#### Instructions

Edit pattern1 so that it correctly matches all case-insensitive mentions of "Amazon" plus a title-cased proper noun.
Edit pattern2 so that it correctly matches all case-insensitive mentions of "ad-free", plus the following noun.

In [41]:
# re-initialize Matcher() again
matcher = Matcher(nlp.vocab)

doc = nlp("Twitch Prime, the perks program for Amazon Prime members offering free loot, \
games and other benefits, is ditching one of its best features: ad-free viewing. According \
to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included \
as a part of Twitch Prime for new members, beginning on September 14. However, members \
with existing annual subscriptions will be able to continue to enjoy ad-free viewing until \
their subscription comes up for renewal. Those with monthly subscriptions will have access to \
ad-free viewing until October 15.")

Original code below:

In [42]:
# Create the match patterns
pattern1 = [{'LOWER': 'Amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad-free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', [pattern1], on_match = None)
matcher.add('PATTERN2', [pattern2], on_match = None)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [43]:
# our fix is below
# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', [pattern1], on_match = None)
matcher.add('PATTERN2', [pattern2], on_match = None)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


Why do we have to separate `ad-free` into three parts for Spacy `Matcher`. It needs more digging.

---

### Efficient phrase matching

Sometimes it's more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world.

We already have a list of countries, so let's use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES. The nlp object and a test doc have already been created and the doc.text has been printed to the shell.

#### Instructions

Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
Add the phrase patterns and call the matcher on the doc.

In [44]:
# read the countries from COUNTRIES.txt
with open('data/COUNTRIES.txt', 'r') as f:
    COUNTRIES = [c.rstrip('\n') for c in f.readlines()]
text = 'Czech Republic may help Slovakia protect its airspace'

doc = nlp(text)

In [45]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))

# patterns = [nlp.make_doc(country) for country in COUNTRIES]

matcher.add('COUNTRY', patterns, on_match = None)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


---

### Extracting countries and relationships

In the previous exercise, you wrote a script using spaCy's PhraseMatcher to find country names in text. Let's use that country matcher on a longer text, analyze the syntax and update the document's entities with the matched countries. The nlp object has already been created.

The text is available as the variable text, the PhraseMatcher with the country patterns as the variable matcher. The Span class has already been imported.

#### Instructions 1/2

- Iterate over the matches and create a Span with the label "GPE" (geopolitical entity).
- Overwrite the entities in doc.ents and add the matched span.

- Update the script and get the matched span's root head token.
- Print the text of the head token and the span.

In [46]:
# read the file
with open ('data/long_text.txt', 'r') as f:
    text = f.read()

In [47]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_md')
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))

# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')
    print(span)
    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('US', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Yugoslavia', 'GPE'), ('Somalia', 'GPE'), ('US', 'GPE'), ('Mogadishu', 'GPE'), ('Bosnia', 'GPE'), ('Rwanda', 'GPE'), ('US', 'GPE'), ('Britain', 'GPE'), ('Singapore', 'GPE'), ('the United States', 'GPE'), ('Afghanistan', 'GPE'), ('the United States', 'GPE'), ('Iraq', 'GPE'), ('Darfur', 'GPE'), ('Sudan', 'GPE'), ('the Democratic Republic of Congo', 'GPE'), ('Haiti', 'GPE'), ('UN\\', 'GPE')]


In [48]:
# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE" and overwrite the doc.ents
    span = Span(doc, start, end, label='GPE')
    doc.ents = list(doc.ents) + [span]
    
    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)

---

# Chapter 3: Processing Pipelines

### What happens when you call nlp?

What does spaCy do when you call nlp on a string of text? The IPython shell has a pre-loaded nlp object that logs what's going on under the hood. Try processing a text with it!

doc = nlp("This is a sentence.")

#### Instructions

Run the tagger, parser and entity recognizer and then the tokenizer.

**Tokenize the text and apply each pipeline component in order.**

Connect to the spaCy server to compute the result and return it.

Initialize the language, add the pipeline and load in the binary model weights.

In [49]:
doc = nlp("This is a sentence.")

In [50]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x16cde46a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x16cde47c0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x16cdc2ab0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x16ae5cd40>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x169f48a00>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x16cdc3220>)]

---

### Inspecting the pipeline

Let's inspect the small English model's pipeline!

#### Instructions

- Load the en_core_web_sm model and create the nlp object.
- Print the names of the pipeline components using nlp.pipe_names.
- Print the full pipeline of (name, component) tuples using nlp.pipeline

In [51]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

print('\n')

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x16bef7700>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x16bef70a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x16f68eff0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x170178fc0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x1701a3100>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x16f68ec00>)]


---

### Use cases for custom components

Which of these problems can be solved by custom pipeline components? Choose all that apply!

1. updating the pre-trained models and improving their predictions
2. computing your own values based on tokens and their attributes
3. adding named entities, for example based on a dictionary
4. implementing support for an additional language

#### Possible Answers

- 1 and 2.
- 1 and 3.
- 1 and 4.
- **2 and 3.**
- 2 and 4.
- 3 and 4.

---

### Simple components

The example shows a custom component that prints the character length of a document. Can you complete it? spacy has already been imported for you.

#### Instructions 1/3

- Complete the component function with the doc's length.
- Add the length_component to the existing pipeline as the first component.
- Try out the new pipeline and process any text with the nlp object – for example "This is a sentence.".

In [52]:
from spacy.language import Language

# Define the custom component
@Language.component('length_component')
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc
  
# Load the small English model and Add the component first in the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('length_component', first=True)

# Process a text
doc = nlp('This is a sentence.')

This document is 5 tokens long.


---

### Complex components

In this exercise, you'll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents.

A PhraseMatcher with the animal patterns has already been created as the variable matcher. The small English model is available as the variable nlp. The Span object has already been imported for you.

#### Instructions 1/3

Define the custom component and apply the matcher to the doc.
Create a Span for each match, assign the label ID for 'ANIMAL' and overwrite the doc.ents with the new spans.

In [53]:
animal_patterns = ['Golden Retriever', 'cat', 'turtle', 'Rattus norvegicus']
nlp = spacy.load('en_core_web_sm')

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
patterns = list(nlp.pipe(animal_patterns))

matcher.add('ANIMALS', patterns, on_match = None)

In [54]:
# Define the custom component
@Language.component('animal_component')
def animal_component(doc):
    # Create a Span for each match and assign the label 'ANIMAL'
    # and overwrite the doc.ents with the matched spans
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline after the 'ner' component 
nlp.add_pipe('animal_component', after='ner')

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


---

### Setting extension attributes (1)

Let's practice setting some extension attributes. The nlp object has already been created for you and the Doc, Token and Span classes are already imported.

Remember that if you run your code more than once, you might see an error message that the extension already exists. That's because DataCamp will re-run your code in the same session. To solve this, you can set force=True on set_extension, or reload to start a new Python session. None of this will affect the answer you submit.

#### Instructions 1/2

Use Token.set_extension to register is_country (default False).
Update it for "Spain" and print it for all tokens.

2

Use Token.set_extension to register 'reversed' (getter function get_reversed).
Print its value for each token.

In [55]:
from spacy.tokens import Doc, Span, Token

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False, force = True)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [56]:
# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]
  
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


---

### Setting extension attributes (2)

Let's try setting some more complex attributes using getters and method extensions. The nlp object has already been created for you and the Doc, Token and Span classes are already imported.

Remember that if you run your code more than once, you might see an error message that the extension already exists. That's because DataCamp will re-run your code in the same session. To solve this, you can set force=True on set_extension, or reload to start a new Python session. None of this will affect the answer you submit.

#### Instructions 1/2

Complete the has_number function .
Use Doc.set_extension to register 'has_number' (getter get_has_number) and print its value.

2

Use Span.set_extension to register 'to_html' (method to_html).
Call it on doc[0:2] with the tag 'strong'.

In [57]:
# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number)

# Process the text and check the custom has_number attribute 
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

has_number: True


In [58]:
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html(tag='strong'))

<strong>Hello world</strong>


---

### Entities and extensions

In this exercise, you'll combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

The Span class is already imported and the nlp object has been created for you.

#### Instructions

Complete the get_wikipedia_url getter so it only returns the URL if the span's label is in the list of labels.
Set the Span extension 'wikipedia_url' using the getter get_wikipedia_url.
Iterate over the entities in the doc and output their Wikipedia URL.

In [59]:
nlp = spacy.load('en_core_web_sm')

def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url, force = True)

doc = nlp("In over fifty years from his very first recordings right through to his last album, \
David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


This can be useful to create automated links for certain entities.

---

### Components with extensions

Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.

The nlp object has already been created and the Span class is already imported. A phrase matcher with all countries is available as the variable matcher. A dictionary of countries mapped to their capital cities is available as the variable capitals.

#### Instructions 1/3

- Complete the countries_component and create a Span with the label 'GPE' (geopolitical entity) for all matches.
- Add the component to the pipeline.
- Register the Span extension attribute 'capital' with the getter get_capital.
- Process the text and print the entity text, entity label and entity capital for each entity span in doc.ents.

In [60]:
# read the capitals dictionary from json file
import json

with open('data/capitals.json', 'r') as f:
    capitals = json.load(f)

In [61]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_md')
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))

matcher.add('COUNTRY', patterns, on_match = None)

In [62]:
@Language.component('countries_component')
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label='GPE')
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline
nlp.add_pipe('countries_component')

# Register capital and getter that looks up the span text in country capitals
Span.set_extension('capital', getter=lambda span: capitals.get(span.text))

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


One of the problems I have encountered is forgetting to create a `Matcher` with countries list has been appended. If you try to replicate DataCamp, this is something that you will be missing.

---

### Processing streams

In this exercise, you'll be using nlp.pipe for more efficient text processing. The nlp object has already been created for you. A list of tweets about a popular American fast food chain are available as the variable TEXTS.

#### Instructions 1/3

- Rewrite the example to use nlp.pipe. Instead of iterating over the texts and processing them, iterate over the doc objects yielded by nlp.pipe.
- Rewrite the example to use nlp.pipe. Don't forget to call list() around the result to turn it into a list.
- Rewrite the example to use nlp.pipe. Don't forget to call list() around the result to turn it into a list.

In [63]:
TEXTS = ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']


In [64]:
# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
['open']
['terrible', 'payin']


In [65]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

() () () (Spain,) () () ()


In [66]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))

---

### Processing data with context

In this exercise, you'll be using custom attributes to add author and book meta information to quotes.

A list of (text, context) examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'. The nlp object has already been created for you.

#### Instructions 1/2

Import the Doc class and use the set_extension method to register the custom attributes 'author' and 'book', which default to None.

In [67]:
DATA = [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

In [68]:
# Import the Doc class and register the extensions 'author' and 'book'
from spacy.tokens import Doc
Doc.set_extension('book', default=None)
Doc.set_extension('author', default=None)

for doc, context in nlp.pipe(DATA, as_tuples = True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



---

### Selective processing

In this exercise, you'll use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text. The small English model is already loaded in as the nlp object.

#### Instructions 1/2

- Rewrite the code to only tokenize the text using nlp.make_doc

In [69]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered \
in the city of College Park, Georgia, specializing in chicken sandwiches."

# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [70]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the \
city of College Park, Georgia, specializing in chicken sandwiches."

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(Georgia,)




---

# Chapter 4: Training a neural network model

### Purpose of training

While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost always want to fine-tune them with more examples. You can do this by training them with more labelled data.

What does training not help with?

#### Possible Answers

- Improve model accuracy on your data.
- Learn new classification schemes.
- Discover patterns in unlabelled data.

---

### Creating training data (1)

spaCy's rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

The nlp object has already been created for you and the Matcher is available as the variable matcher.

#### Instructions

- Write a pattern for two tokens whose lowercase forms match 'iphone' and 'x'.
- Write a pattern for two tokens: one token whose lowercase form matches 'iphone' and an optional digit using the '?' operator.

In [71]:
TEXTS = ['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

In [72]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_md')
matcher = PhraseMatcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '+'}]

# Add patterns to the matcher
matcher = Matcher(nlp.vocab)
matcher.add('GADGET', [pattern1, pattern2], on_match = None)

If I change `'OP'` from `+` to `?` it catches multiple cases for spans. Therefore, it creates overlapping entities which cause an error in the following exercises:

```
ValueError: [E103] Trying to set conflicting doc.ents: '(4, 10, 'GADGET')' and '(4, 12, 'GADGET')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap. To work with overlapping entities, consider using doc.spans instead.
```

Therefore, I used `+` which is >>> Require the pattern to match 1 or more times.

See [this](https://spacy.io/usage/rule-based-matching#quantifiers) for more information.

---

### Creating training data (2)

Let's use the match patterns we've created in the previous exercise to bootstrap a set of training examples. The nlp object has already been created for you and the Matcher with the added patterns pattern1 and pattern2 is available as the variable matcher. A list of sentences is available as the variable TEXTS.

#### Instructions 1/2

- Create a doc object for each text using nlp.pipe and find the matches in it.
- Create a list of (start, end, label) tuples for the matches.

- Match on the doc and create a list of matched spans.
- Format each example as a tuple of the text and a dict, mapping 'entities' to the entity tuples.
- Append the example to TRAINING_DATA and inspect the printed data.

In [73]:
# # Create a Doc object for each text in TEXTS
# for doc in nlp.pipe(TEXTS):
#     # Find the matches in the doc
#     matches = matcher(doc)
    
#     # Get a list of (start, end, label) tuples of matches in the text
#     entities = [(start, end, 'GADGET') for match_id, start, end in matches]
#     print(doc.text, entities)

In [74]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': []})
('I need a new phone! Any tips?', {'entities': []})


---

### Setting up the pipeline

In this exercise, you'll prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text – for exampe, "iPhone X".

spacy has already been imported for you.

#### Instructions 1/3

- Create a blank 'en' model, for example using the spacy.blank method.
- Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.
- Add the new label 'GADGET' to the entity recognizer using the add_label method on the pipeline component.

In [75]:
# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.add_pipe('ner')

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

1

---

### Building a training loop

Let's write a simple training loop from scratch!

The pipeline you've created in the previous exercise is available as the nlp object. It already contains the entity recognizer with the added label 'GADGET'.

The small set of labelled examples that you've created previously is available as the global variable TRAINING_DATA. To see the examples, you can print them in your script or in the IPython shell. spacy and random have already been imported for you.

#### Instructions 1/3

- Call nlp.begin_training, create a training loop for 10 iterations and shuffle the training data.
- Create batches of training data using spacy.util.minibatch and iterate over the batches.
- Convert the (text, annotations) tuples in the batch to lists of texts and annotations.
- For each batch, use nlp.update to update the model with the texts and annotations.

In [76]:
import random
from spacy.training.example import Example

In [77]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        for text, annotations in batch:
            # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
        
            # Update the model
            nlp.update([example], losses=losses)
            print(losses)

{'ner': 3.3333334922790527}
{'ner': 8.180318057537079}
{'ner': 13.557176649570465}
{'ner': 20.873930156230927}
{'ner': 27.07882982492447}
{'ner': 30.862092196941376}
{'ner': 3.1524229645729065}
{'ner': 5.1253800094127655}
{'ner': 7.442873924970627}
{'ner': 9.033771596848965}
{'ner': 9.566545123234391}
{'ner': 10.793928442522883}
{'ner': 0.8937611355213448}
{'ner': 0.9122877491463441}
{'ner': 0.9184966012253426}
{'ner': 1.6712227741008974}
{'ner': 4.0508636953085215}
{'ner': 6.332585643911557}
{'ner': 1.8881516805537188}
{'ner': 1.8881595654577978}
{'ner': 1.8882107583402816}
{'ner': 2.8234261669064065}
{'ner': 3.4541588698815726}
{'ner': 4.205019054624225}
{'ner': 4.551585658926771e-05}
{'ner': 0.24552291906176235}
{'ner': 0.451009931903815}
{'ner': 0.48081180547336855}
{'ner': 0.4846418308257415}
{'ner': 0.48464183420350176}
{'ner': 9.40943683719591e-06}
{'ner': 3.735847416379192e-05}
{'ner': 4.8114050091441554e-05}
{'ner': 5.330914049146186e-05}
{'ner': 5.331062105000568e-05}
{'ner':

---

### Exploring the model

Let's see how the model performs on unseen data! To speed things up a little, here's a trained model for the label 'GADGET', using the examples from the previous exercise, plus a few hundred more. The loaded model is already available as the nlp object. A list of test texts is available as TEST_DATA.

#### Instructions 1/2

- Process each text in TEST_DATA using nlp.pipe.
- Print the document text and the entities in the text.

In [78]:
TEST_DATA = ['Apple is slowing down the iPhone 8 and iPhone X - how to stop it',
 "I finally understand what the iPhone X 'notch' is for",
 'Everything you need to know about the Samsung Galaxy S9',
 'Looking to compare iPad models? Here’s how the 2018 lineup stacks up',
 'The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple',
 'what is the cheapest ipad, especially ipad pro???',
 'Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics']

In [79]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
(iPhone 8, iPhone X) 


I finally understand what the iPhone X 'notch' is for
(iPhone X,) 


Everything you need to know about the Samsung Galaxy S9
() 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
() 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
(iPhone 8, iPhone 8) 


what is the cheapest ipad, especially ipad pro???
() 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
() 




---

### Good data vs. bad data

Here's an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

#### Instructions 1/2

Question
Why is this data and label scheme problematic?

#### Possible Answers

- **Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn.**

- "Paris" and "Arkansas" should also be labelled as tourist destinations for consistency. Otherwise, the model will be confused.

- Rare out-of-vocabulary words like the misspelled "amsterdem" shouldn't be labelled as entities.

```python
('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'TOURIST_DESTINATION')]})

('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'TOURIST_DESTINATION')]})

("There's also a Paris in Arkansas, lol", {'entities': []})

('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'TOURIST_DESTINATION')]})
```

- Rewrite the TRAINING_DATA to only use the label GPE (cities, states, countries) instead of TOURIST_DESTINATION.
- Don't forget to add tuples for the GPE entities that weren't labeled in the old data.

In [80]:
TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'GPE')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'GPE')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'GPE')]})
]
     
print(*TRAINING_DATA, sep='\n')

('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'GPE')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'GPE')]})
("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'GPE')]})


Great work! Once the model achieves good results on detecting `GPE` entities in the traveler reviews, you could add a rule-based component to determine whether the entity is a tourist destination in this context. For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

---

### Training multiple labels

Here's a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you'll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, [Brat](http://brat.nlplab.org/), a popular open-source solution, or [Prodigy](https://prodi.gy/), our own annotation tool that integrates with spaCy.

After this exercise you will be nearly done with the course! If you enjoyed it, feel free to send Ines a thank you via Twitter - she'll appreciate it! Tweet to Ines

#### Instructions 1/3

- Complete the entity offsets for the WEBSITE entities in the data. Feel free to use len() if you don't want to count the characters.

In [81]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE')]}),
    # And so on...
]

#### Question

A model was trained with the data you just labelled, plus a few thousand similar examples. After training, it's doing great on WEBSITE, but doesn't recognize PERSON anymore. Why could this be happening?

Possible Answers

- It's very difficult for the model to learn about different categories like PERSON and WEBSITE.
- **The training data included no examples of PERSON, so the model learned that this label is incorrect.**
- The hyperparameters need to be retuned so that both entity types can be recognized.

- Update the training data to include annotations for the PERSON entities "PewDiePie" and "Alexis Ohanian".

In [82]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(0, 9, 'PERSON'), (18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE'), (15, 29, 'PERSON')]}),
    # And so on...
]