# Finding words, phrases, names and concepts

## Introduction to spaCy

### The nlp object

At the center of spaCy is the object containing the processing pipeline. We usually call this variable **nlp**.

You can use the **nlp** object like a function to analyze text.
- It contains all the different components in the processing pipeline.
- It also includes language-specific rules used for tokenizing the text into words and punctuation.

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

### The Doc object

When you process a text with the nlp object, spaCy creates a **Doc** object, short for "document".

The **Doc** lets you access information about the text in a structured way and no information is lost.

The **Doc** behaves like a normal Python sequence, and lets you iterate over its tokens, or get token by its index.

In [2]:
# Created by processing a string f text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### The Token object

**Token** objects represent the tokens in a document. For example, a word or a punctuation character.

To get a **Token** at a specific position, you can index into the **Doc**.

**Token** objects also provide various attributes that let you access more information about the tokens. For example, the **.text** attribute returns the verbatim token text.

In [3]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### The Span object

A **Span** object is a slice of the document consisting of one or more tokens. It's only a view of the **Doc**, it doesn't contain any data itself.

To create a **Span**, you can use Python's slice notation.

In [4]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text object
print(span.text)

world!


### Lexical attributes

**Lexical attributes** refer to the entry in the vocabulary and don't depend on the token's context.

In [5]:
doc = nlp("It costs $5.")

print('Index:   ', [token.i for token in doc]) # "i" is the index of the token within the parent document

print('Text:    ', [token.text for token in doc]) # "text" returns the token text

print('is_alpha:', [token.is_alpha for token in doc]) # return boolean values indicating whether
                                                      # the token consists of alphanumeric characters

print('is_punct:', [token.is_punct for token in doc]) # return boolean values indicating whether
                                                      # the token is punctuation

print('like_num:', [token.like_num for token in doc]) # return boolean values indicating whether
                                                      # the token resembles a number (e.g. '10', '1', '0', 'ten')

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


## Statistical models

### What are statistical models?

Statistical models:
- Enable spaCy to make predictions _in context_
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on large datasets of labeled example texts
- Can be updated with more examples to fine-tune predictions

### Model Packages

spaCy provides a number of pre-trained model packages you can download.

For example, the "**en_core_web_sm**" package is a small English model that supports all core capabilities and is trained on web text.

The **spacy.load()** method loads a model package by name and returns an nlp object.

The package provides:
- **Binary weights** that enable spaCy to make predictions.
- **Vocabulary**
- **Meta information** (language, pipeline) to tell spaCy which language class to use and how to configure the processing pipeline.

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')

### Predicting Part-of-speech Tags

In [7]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


In spaCy, attributes that return string usually end with an underscore (e.g. **.pos_**). Attributes without the underscore return an ID.

### Predicting Syntactic Dependencies

The "**.dep_**" attribute returns the predicted dependency label.

The "**.head**" attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [8]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


To describe syntactic dependencies, spaCy uses a standardized label scheme.

| Label | Description | Example |
| --- | --- | --- |
| nsubj | nominal subject | She |
| dobj | direct object | pizza |
| det | determiner (article) | the |

### Predicting Named Entities

**Named entities** are "real world objects" that are assigned a name. For example, a person, an organization, or a country.

The **doc.ents** property lets you access the named entities predicted by the model. It returns an iterator of **Span** objects, so we can print the entity text (**ent.text**) and the entity label (**ent.label_**).

In [9]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Tip: the explain method

To get definitions for the most common tags and labels, you can use the **spacy.explain()** helper function.

In [10]:
spacy.explain('GPE')

'Countries, cities, states'

In [11]:
spacy.explain('NNP')

'noun, proper singular'

In [12]:
spacy.explain('dobj')

'direct object'

## Rule-based matching

spaCy's **Matcher** lets you write rules to find words and phrases in text.

### Why not just regular expressions?

- Compared to regular expressions, the **Matcher** works with **Doc** and **Token** objects instead of only strings.
- It is also more flexible: you can search for texts, but also other lexical attributes.
- You can even write rules that use the model's predictions. For example find the word "duck" only if it is a verb, not a noun.

### Match patterns

**Match patterns** are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

- Match exact token texts:
    - <code style="background:#RRGGBB">\[{'ORTH': 'iPhone'}, {'ORTH': 'X'}\]</code>
- Match lexical attributes:
    - <code style="background:#RRGGBB">\[{'LOWER': 'iphone'}, {'LOWER': 'x'}\]</code>
- Match any token attributes
    - <code style="background:#RRGGBB">\[{'LEMMA': 'buy'}, {'POS': 'NOUN'}\]</code>

### Using the Matcher

In [13]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', [pattern], on_match=None)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


- <code style="background:#RRGGBB">match_id:</code> hash value of the pattern name
- <code style="background:#RRGGBB">start:</code> start index of matched span
- <code style="background:#RRGGBB">end:</code> end index of matched span

### Matching lexical attributes

In [14]:
matcher = Matcher(nlp.vocab)

pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
matcher.add('FIFA_PATTERN', [pattern], on_match=None)

doc = nlp("2018 FIFA World Cup: France won!")

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


### Matching other token attributes

In [15]:
matcher = Matcher(nlp.vocab)

pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('LOVE_PATTERN', [pattern], on_match=None)

doc = nlp("I loved dogs but now I love cats more.")

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


### Using operators and quantifiers

**Operators** and **quantifiers** lets you define how often a token should be matched. They can be added using the "**OP**" key.

In [16]:
matcher = Matcher(nlp.vocab)

pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
matcher.add('BUY_PATTERN', [pattern], on_match=None)

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


Here, the "**?**" operator makes the determiner token optional, so it will match a taken with the lemma "buy", an optional article, and a noun.

"**OP**" can have one of four values:

| | Description |
| --- | --- |
| <code style="background:#RRGGBB">{'OP': '!'}</code> | Negation: match 0 times |
| <code style="background:#RRGGBB">{'OP': '?'}</code> | Optional: match 0 or 1 times |
| <code style="background:#RRGGBB">{'OP': '+'}</code> | Match 1 or more times |
| <code style="background:#RRGGBB">{'OP': '*'}</code> | Match 0 or more times|

### Exercise: Writing match patterns

In [17]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [18]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [19]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# Large-scale data analysis with spaCy

## Data Structures: Vocab, Lexemes and StringStore

### Shared vocab and string store

spaCy stores all shared data in a vocabulary, the **Vocab**. This includes words, but also the labels, schemes for tags, and entities.

To save memory, spaCy encodes all strings to **hash values**. If a word occurs more than once, we don't need to save it every time. Instead, spaCy uses a hash function to generate an ID and stores the string only once in the **StringStore**, which is available in **nlp.vocab.strings**

**String store** is a **lookup table** that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

In [20]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]

print(coffee_hash)
print(coffee_string)

3197928453018144401
coffee


Hash IDs can't be reversed, though. If a word is not in a vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [21]:
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

To get the hash for a string, we can look it up in **nlp.vocab.strings**

To get a string representation of a hash, we can look up the hash.

In [22]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


A **Doc** object also exposes its vocab and strings.

In [23]:
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


### Lexemes: entries in the vocabulary

**Lexemes** are context-independent entries in the vocabulary.

You can get a **lexeme** by looking up a string or a hash ID in the vocab.

**Lexemes** expose attributes, just like tokens.

They hold **context-indepenent** information about a word:
- Word text: **lexeme.text** and **lexeme.orth** (the hash)
- Lexical attributes like **lexeme.is_alpha**
- **Not** context-dependent part-of-speech tags, dependencies or entity labels.

In [24]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


## Data Structures: Doc, Span and Token

### The Doc object

The **Doc** is one of the central data structures in spaCy.

It is created automatically when you process a text with the **nlp** object. But you can also instantiate the class manually. After creating the **nlp** object, we can import the **Doc** class from **spacy.tokens**.

The **Doc** class takes three arguments: the share vocab, the words and the spaces.

In [25]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

### The Span object

A **Span** is a slice of a **Doc** consisting of one or more tokens.

The **Span** takes at least three arguments: the **Doc** it refers to, and the start and end (exclusive) index of the **Span**.

To create a **Span** manually, we can also import the class from **spacy.tokens**. We can then instantiate it with the **Doc** and the **Span**'s start and end index.

To add an **entity label** to the **Span**, we can pass in the label name as the **label** argument. For consistency, we usually write label names in capital letters.

The **doc.ents** are writable so we can add entities manually by overwriting it with a list of spans.

In [26]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

### Best practices

- <code style="background:#RRGGBB">Doc</code> and <code style="background:#RRGGBB">Span</code> are very powerful and hold references and relationships of words and sentences
    - Convert result to strings as late as possible
    - Use token attributes if available - for example, <code style="background:#RRGGBB">token.i</code> for the token index
- Don't forget to pass in the shared <code style="background:#RRGGBB">vocab</code>

## Word vectors and semantic similarity

### Comparing semantic similarity
- <code style="background:#RRGGBB">spaCy</code> can compare two objects (documents, spans, or single tokens) and predict similarity
- <code style="background:#RRGGBB">Doc.similarity()</code>, <code style="background:#RRGGBB">Span.similarity()</code>, <code style="background:#RRGGBB">Token.similarity()</code>
    - Take another object and return a similarity score (<code style="background:#RRGGBB">0</code> to <code style="background:#RRGGBB">1</code>)
- **Important:** needs a model that has word vectors included, for example:
    - **YES**: <code style="background:#RRGGBB">en_core_web_md</code> (medium model)
    - **YES**: <code style="background:#RRGGBB">en_core_web_lg</code> (large model)
    - **NO**: <code style="background:#RRGGBB">en_core_web_sm</code> (small model)

So, if you want to use vectors, always go with a model that ends in "md" or "lg".

### Similarity examples

In [27]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

print(doc1.similarity(doc2))

0.8627204117787385


In [28]:
# Compare two tokens
doc = nlp("I like pizza and pasta")

token1 = doc[2]
token2 = doc[4]

print(token1.similarity(token2))

0.7369546


You can also use the similarity methods to compare different types of objects. For example, a document and a token.

In [29]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


In [30]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


### How does spaCy predict similarity?

- Similarity is determined using **word vectors**
- Multi-dimensional meaning of representations of words
- Generated using an algorith like **Word2Vec** and lots of text
- Can be added to spaCy's statistical models
- Default: **cosine similarity**, but can be adjusted
- <code style="background:#RRGGBB">Doc</code> and <code style="background:#RRGGBB">Span</code> vectors default of token vectors
- Short phrases are better than long documents with many irrelevant words

### Word vectors in spaCy

In [31]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')
doc = nlp("I have a banana")

# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

### Similarity depends on the application context

- Useful for many applications: recommendation systems, flagging duplicates, etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do

In [32]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421


## Combining models and rules

### Statistical predictions vs. rules

|  | Statistical models | Rule-based systems |
| --- | --- | --- |
| **Use cases** | application needs to _generalize_ based on examples | dictionary with finite number of examples |
| **Real-world examples** | product names, person names, subject/object relationships | countries of the world, cities, drug names, dog breeds |
| **spaCy features** | entity recognizer, dependency parser, part-of-speech tagger | tokenizer, <code style="background:#RRGGBB">Matcher</code>, <code style="background:#RRGGBB">PhraseMatcher</code> |

## Recap: Rule-based Matching

In [33]:
# Initialize with the shared vocab
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', [pattern])

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]
matcher.add('VERY_HAPPY', [pattern])

# Calling a matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

# Iterate over the matches returned by Matcher
for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: love cats
Matched span: very happy
Matched span: very very happy


### Adding statistical predictions

**Span** objects give us access to the original document and all other token attributes and linguistic features predicted by the model.

For example, we can get the **Span**'s **root token**. If the **Span** consists of more than one token, this will be the token that decides the category of the phrase. For example, the root of "Golden Retriever" is "Retriever".

We can also find the **head token** of the root. This is the syntactic "**parent**" that governs the phrase - in this case, the verb "have".

Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".

In [34]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', [[{'LOWER': 'golden'}, {'LOWER': 'retriever'}]])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    # Get the span's root token and root head token
    print("Root token:", span.root.text)
    print("Rood head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Rood head token: have
Previous token: a DET


### Efficient phrase matching

The **PhraseMatcher** is another helpful tool to find sequences of words in your data. It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.

- <code style="background:#RRGGBB">PhraseMatcher</code> like regular expressions or keyword search - but with access to the tokens!
- Takes <code style="background:#RRGGBB">Doc</code> object as patterns
- More efficient and faster than the <code style="background:#RRGGBB">Matcher</code>
- Great for matching large word lists

In [35]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
pattern = nlp("Golden Retriever")
matcher.add('DOG', [pattern])

doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


# Processing Pipelines

## Processing Pipelines

### What happens when you call nlp?

- First, the **tokenizer** is applied to turn the string of text into a **Doc** object.
- Next, a series of pipeline components is applied to the **Doc** in order:
    - **Tagger**
    - **Parser**
    - **Entity recognizer**
- Finally, the processed **Doc** is returned so you can work with it.

In [36]:
doc = nlp("This is a sentence.")

### Built-in pipeline components

| Name | Description | Creates |
| --- | --- | --- |
| **tagger** | Part-of-speech tagger | <code style="background:#RRGGBB">Token.tag</code> |
| **parser** | Dependency parser | <code style="background:#RRGGBB">Token.dep</code>, <code style="background:#RRGGBB">Token.head</code>, <code style="background:#RRGGBB">Doc.sents</code>, <code style="background:#RRGGBB">Doc.noun_chuncks</code>|
| **ner** | Named entity recognizer | <code style="background:#RRGGBB">Doc.ents</code>, <code style="background:#RRGGBB">Token.ent_iob</code>, <code style="background:#RRGGBB">Token.ent_type</code> |
| **textcat** | Text classifier | <code style="background:#RRGGBB">Doc.cats</code>|

### Under the hood

All models you can load into spaCy include several files and meta JSON.

The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.

- Pipeline defined in model's <code style="background:#RRGGBB">meta.json</code> in order
- Built-in components need binary data to make predictions. The data is included in the model package and loaded into the component when you load the model.

**meta.json** for **en_core_web_sm**
```python
{
    "lang":"en",
    "name":"core_web_sm",
    "pipeline": ["tagger", "parser", "ner"]
}
```

### Pipeline attributes

- <code style="background:#RRGGBB">nlp.pipe_names</code>: list of pipeline components
- <code style="background:#RRGGBB">nlp.pipeline</code>: list of <code style="background:#RRGGBB">(name, component)</code> tuples

In [37]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']


In [38]:
print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x00000123FC839720>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x00000123FE70A220>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x00000123FE73D8E0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x00000123FE73D460>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x0000012393A00E40>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x0000012393A05840>)]


## Custom pipeline components

**Custom pipeline components** let you add your own function to the spaCy pipeline that is executed when you call the **nlp** object on a text - for example, to modify the **Doc** and add more data to it.

### Why custom components?

- Make a function execute automatically when you call <code style="background:#RRGGBB">nlp</code>
- Add your own metadata to documents and tokens
- Updating built-in attributes like <code style="background:#RRGGBB">doc.ents</code>

### Anatomy of a component
- Function that takes a <code style="background:#RRGGBB">doc</code>, modifies it and returns it
- Can be added using the <code style="background:#RRGGBB">nlp.add_pipe</code> method

In [39]:
from spacy.language import Language

nlp = spacy.load("en_core_web_sm")

@Language.component("my_component")
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe("my_component")
#nlp.add_pipe("my_component", last=True)
#nlp.add_pipe("my_component", first=True)
#nlp.add_pipe("my_component", before="ner")
#nlp.add_pipe("my_component", before="tagger")

<function __main__.custom_component(doc)>

To specify where to add the component in the pipeline, you can use the following keyword arguments:

| Argument | Description | Examples |
| --- | --- | --- |
| <code style="background:#RRGGBB">last</code> | If <code style="background:#RRGGBB">True</code>, add last | <code style="background:#RRGGBB">nlp.add_pipe(component, last=True)</code> |
| <code style="background:#RRGGBB">first</code> | If <code style="background:#RRGGBB">True</code>, add first | <code style="background:#RRGGBB">nlp.add_pipe(component, first=True)</code> |
| <code style="background:#RRGGBB">before</code> | Add before component | <code style="background:#RRGGBB">nlp.add_pipe(component, before='ner')</code> |
| <code style="background:#RRGGBB">after</code> | Add after component | <code style="background:#RRGGBB">nlp.add_pipe(component, after='tagger')</code> |

### Example: a simple component

In [40]:
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

@Language.component("custom_component")
# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("custom_component", first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

# Process a text
doc = nlp("Hello world!")

Pipeline: ['custom_component', 'tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
Doc length: 3


## Extension attributes

### Setting custom attributes

- Add custom metadata to documents, tokens and spans
- The data can be added once, or it can be computed dynamically.
- Accessible via <code style="background:#RRGGBB">._</code> property
```python
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False
```
- Attributes can be registered on the global <code style="background:#RRGGBB">Doc</code>, <code style="background:#RRGGBB">Token</code>, or <code style="background:#RRGGBB">Span</code> using the <code style="background:#RRGGBB">set_extension</code> method
    - The first argument is the attribute name.
    - Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

In [41]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

### Extension attribute types
1. Attribute extensions
2. Property extensions
3. Method extensions

### Attribute extensions

**Attribute extensions** set a default value that can be overwritten

In [42]:
from spacy.tokens import Token

nlp = spacy.load('en_core_web_sm')

# Set extension on the Token with default value
Token.set_extension('is_color', default=False, force=True)
doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

### Property extensions

**Property extensions** work like properties in Python.
- Define a getter function and an optional setter function.
- Getter only called when you _retrieve_ the attribute value. This lets you compute the value dynamically and even take other custom attributes into account.

In [43]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)
doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


- <code style="background:#RRGGBB">Span</code> extensions should almost always use a getter. Otherwise, you'd have to update every possible span ever by hand to set all the values.

In [44]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color, force=True)
doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


### Method extensions

**Method extensions** make the extension attribute a callable method.

- Assign a **function** that becomes available as an object method.
- Let you pass **arguments** to the extension function.

In [45]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc
    
# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)
doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud


### Exercise: Setting extension attributes

In [46]:
# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [47]:
# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]
  
# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension('reversed', getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed:', token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [48]:
# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)

# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number)

# Process the text and check the custom has_number attribute 
doc = nlp("The museum closed for five years in 2012.")
print('has_number:', doc._.has_number)

has_number: True


In [49]:
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

# Register the Span property extension 'to_html' with the method to_html
Span.set_extension('to_html', method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html('strong'))

<strong>Hello world</strong>


### Exercise: Entities and extensions

In [50]:
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


## Scaling and performance

### Processing large volumes of texts

If you need to process a lot of texts and generate a lot of **Doc** in a row, the <code style="background:#RRGGBB">nlp.pipe</code> method can speed this up significantly.
- Processes texts as a stream, yields <code style="background:#RRGGBB">Doc</code> objects.
- Much faster than calling <code style="background:#RRGGBB">nlp</code> on each text, because it batches up the texts.

In order to get a list of docs, remember to call the list method around it.

**BAD:**
```python
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```

**GOOD:**
```python
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```

### Passing in context

- Setting <code style="background:#RRGGBB">as_tuples=True</code> on <code style="background:#RRGGBB">nlp.pipe</code> lets you pass in <code style="background:#RRGGBB">(text, context)</code> tuples
- Yields <code style="background:#RRGGBB">(doc, context)</code> tuples
- Useful for associating metadata with the <code style="background:#RRGGBB">doc</code>

In [51]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('Add another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
Add another text 16


You can even add the context metadata to custom attributes.

In [52]:
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('Add another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

### Using only the tokenizer

Another common scenario: Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text. Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.

If you only need a tokenized **Doc** object, you can use the <code style="background:#RRGGBB">nlp.make_doc</code> method instead, which takes the text and returns the <code style="background:#RRGGBB">Doc</code>.

In [53]:
doc = nlp.make_doc("Hello world!")

### Disabling pipeline components

- Use <code style="background:#RRGGBB">nlp.disable_pipes</code> to temporarily disable one or more pipes. It takes a variable number of arguments, the string names of the pipeline components to disable.

For example, if you only want to use the entity recognizer to process a document, you can temporarily disable the tagger and parser.
- After the <code style="background:#RRGGBB">with</code> block, the disabled pipeline components are automatically restored.
- In the <code style="background:#RRGGBB">with</code> block, spaCy will only run the remaining components.


```python
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)
```

# Training a neural network model

## Training and updating models

### Why updating the model?

- Better results on your specific domain
- Learn classification schemes specifically for your problem
- Essential for text classification
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing

### How training works
1. **Initialize** the model weights randomly with <code style="background:#RRGGBB">nlp.begin_training</code>
2. **Predict** a few examples with the current weights by calling <code style="background:#RRGGBB">nlp.update</code>
3. **Compare** prediction with true labels
4. **Calculate** how to change weights to improve predictions
5. **Update** weights slightly
6. Go back to 2.

<img src="Training and Updating Models.png">

- **Training data**: Examples and their annotations.
- **Text**: The input text the model should predict a label for.
- **Label**: The label the model should predict.
- **Gradient**: How to change the weights.

### Example: Training the entity recognizer
- The **entity recognizer** tags words and phrases in context
    - This means that the training data needs to include texts, the entities they contain, and the entity labels
- Each token can only be part of one entity
- Examples need to come with context
    - The easiest way to do this is to show the model a text and a list of character offsets.
        ```python
        ("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})
        ```
- Texts with no entities are also important
    ```python
    ("I need a new phone! Any tips?", {'entities': []})
    ```
- **Goal**: teach the model to generalize

### The training data
- Examples of what we want the model to predict in context
- Update an **existing model**: a few hundred to a few thousand examples
- Train a **new category**: a few thousand to a million examples
    - spaCy's English models: 2 million words
- Usually created manually by human annotators
- Can be semi-automated - for example, using spaCy's <code style="background:#RRGGBB">Matcher</code>!

### Exercise: Creating training data

In [54]:
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
TEXTS = ['How to preorder the iPhone X',
         'iPhone X is coming',
         'Should I pay $1,000 for the iPhone X?',
         'The iPhone 8 reviews are here',
         'Your iPhone goes up to 11 today',
         'I need a new phone! Any tips?']

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', patterns = [pattern1, pattern2])

In [55]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)
    
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for match_id, start, end in matches]
    print(doc.text, entities)

How to preorder the iPhone X [(4, 6, 'GADGET'), (4, 5, 'GADGET')]
iPhone X is coming [(0, 2, 'GADGET'), (0, 1, 'GADGET')]
Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET'), (7, 8, 'GADGET')]
The iPhone 8 reviews are here [(1, 2, 'GADGET'), (1, 3, 'GADGET')]
Your iPhone goes up to 11 today [(1, 2, 'GADGET')]
I need a new phone! Any tips? []


In [56]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')    

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


## The training loop

### The steps of a training loop

The **training loop** is a series of steps that's performed to train or update a model.

1. **Loop** for a number of times.
2. **Shuffle** the training data.
3. **Divide** the data into batches.
4. **Update** the model for each batch.
5. **Save** the updated model.

## Example loop

```python
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, 6):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
        
# Save the model
nlp.to_disk(path_to_model)
```

## Updating an existing model

- Improve the predictions on new data
- Especially useful to improve existing categories, like <code style="background:#RRGGBB">PERSON</code>
- Also possible to add new categories
- Be careful and make sure the model doesn't "forget" the old ones

## Setting up a new pipeline from scratch

In this example, we start off with a blank English model using the **spacy.blank()** method. The blank model doesn't have any pipeline components, only the language data and tokenization rules.

```python
# Start with blank English model
nlp = spacy.blank('en')

# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add a new label
ner.add_label('GER')

# Start the training
nlp.begin_training()

# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
```

## Training best practices

### Problem 1: Models can "forget" things
- Existing model can overfit on new data
    - e.g.: if you only update it with <code style="background:#RRGGBB">WEBSITE</code>, it can "unlearn" what a <code style="background:#RRGGBB">PERSON</code> is
- Also known as "catastrophic forgetting" problem

## Solution 1: Mix in previously correct predictions
- For example, if you're training <code style="background:#RRGGBB">WEBSITE</code>, also include examples of <code style="background:#RRGGBB">PERSON</code>
- Run existing spaCy model over data and extract all other relevant entities

**BAD:**
```python
TRAINING DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]
```

**GOOD:**
```python
TRAINING DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
```

### Problem 2: Models can't learn everything
- spaCy's models make predictions based on **local context**
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
    - For example: <code style="background:#RRGGBB">CLOTHING</code> is better than <code style="background:#RRGGBB">ADULT_CLOTHING</code> and <code style="background:#RRGGBB">CHILDRENS_CLOTHING</code>

### Solution 2: Plan your label scheme carefully
- Pick categories that are reflected in local context
- More generic is better than too specific
- Use rules to go from generic labels to specific categories

**BAD:**
```python
LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
```

**GOOD:**
```python
LABELS = ['CLOTHING', 'BANDS']
```