# Chapter 1: Finding words, Phrases, Names, and Concepts

## 1. Intro to spacy

In [1]:
# Import spaCy
import spacy

# Create a blank English nlp object
nlp = spacy.blank("en")

- contains the processing pipeline
- includes language-specific rules for tokenization etc.

### Doc object

In [2]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### Token object

In [3]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### Span object

In [4]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

world!


### Lexical attributes

In [5]:
doc = nlp("It costs $5.")

print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


## 2. Getting Started

In [8]:
import spacy

# Load blank English model
nlp = spacy.blank("en")
doc = nlp("She ate the pizza")
print(doc.text)

# Load blank german model
nlp = spacy.blank("de")
doc = nlp("Sie zog das Pizza.")
print(doc.text)

# load blank spanish model
nlp = spacy.blank("es")
doc = nlp("Ella comio la pizza.")
print(doc.text)

She ate the pizza
Sie zog das Pizza.
Ella comio la pizza.


## 3. Documents, spans and tokens  

In [11]:
import spacy

nlp = spacy.blank('en')
doc = nlp('I like tree kangaroos and narwhals.')
#first token
first_token = doc[0]
print(first_token.text)

#slice "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

#slice "tree kangaroos and narwhals" 
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

I
tree kangaroos
tree kangaroos and narwhals


## 4. Lexical attributes

In [12]:
# Process the text
doc = nlp(
    r"In 1990, more than 60% of people in East Asia were in extreme poverty. "
    r"Now less than 4% are."
)

# Iterate over the tokens
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## 5. Trained pipelines
What are trained pipelines?
- Models that enable spaCy to predict linguistic attributes in context
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

## 6. Pipeline packages

```python
$ pip install -U spacy
$ python -m spacy download en_core_web_sm
```
- Binary weights
- Vocabulary
- Meta information
- Configuration file

### predicting part-of-speech tags

In [5]:
# load small en model
nlp = spacy.load("en_core_web_sm")

# process text
doc = nlp("she ate the pizza")

# iterate over tokens
# Iterate over tokens in the doc
for token in doc:
    # Perform operations on each token
    print(token.text, token.pos_)

she PRON
ate VERB
the DET
pizza NOUN


### predicting syntactic dependencies

In [6]:
for token in doc: 
    print(token.text, token.pos_, token.dep_, token.head.text)

she PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### predicting named entities

In [7]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities   
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [8]:
spacy.explain("GPE")

'Countries, cities, states'

In [9]:
spacy.explain("NNP")

'noun, proper singular'

In [10]:
spacy.explain("dobj")

'direct object'

## Loading Pipelines
The pipelines we’re using in this course are already pre-installed. For more details on spaCy’s trained pipelines and how to install them on your machine, see the documentation.

Use spacy.load to load the small English pipeline "en_core_web_sm".
Process the text and print the document text.

In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = ("It's official: Apple is the first U.S. public company to reach a $1 trillion market value")

doc = nlp(text)

print(doc.text)

It's official: Apple is the first U.S. public company to reach a $1 trillion market value


You’ll now get to try one of spaCy’s trained pipeline packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

Part 1

- Process the text with the nlp object and create a doc.
- For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = ("It's official: Apple is the first U.S. public company to reach a $1 trillion market value")
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
's          AUX       ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


Part 2

- Process the text and create a doc object.
- Iterate over the doc.ents and print the entity text and label_ attribute.

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "It's official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


## 9. Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

- Process the text with the nlp object.
- Iterate over the entities and print the entity text and label.
- Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [15]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
doc = nlp(text)

#iterate over entities
for ent in doc.ents: 
    print(ent.text, ent.label_)

# get the span for "iphone x"
iphone_x = doc[1:3]

# print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## 10. Ruled-based matching

In [31]:
import spacy
# import matcher
from spacy.matcher import Matcher

#load pipeline in English
nlp = spacy.load("en_core_web_sm")

# initialize matcher with shared vocab
matcher = Matcher(nlp.vocab)

# add pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_X_PATTERN", [pattern])

# process text and text the iphone matcher
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# call the matcher on the doc
matches = matcher(doc)

#iterate over the matches
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


## 11. Using the matcher
Matching lexical attributes

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:
- A token consisting of only digits.
- Three case-insensitive tokens for "fifa", "world" and "cup".
- And a token that consists of punctuation.
- The pattern matches the tokens "2018 FIFA World Cup:".

In [32]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
matcher.add("FIFA WORLD CUP",[pattern])

doc = nlp("2018 FIFA World Cup: France won!")
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['2018 FIFA World Cup:']


In this example, we're looking for two tokens:

- A verb with the lemma "love", followed by a noun.
- This pattern will match "loved dogs" and "love cats".

In [33]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

matcher.add("LOVE", [pattern])

doc = nlp("I loved dogs but now I love cats more.")
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['loved dogs', 'love cats']


Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [30]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

matcher.add("BUY", [pattern])

doc = nlp("I bought a smartphone. Now I'm buying apps.")

# Iterate over the matches
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

bought a
I'm


## 12. Writting match patterns
In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

Part 1

Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [35]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT":"iOS"},{"IS_DIGIT":True}]
matcher.add("IOS_VERSION_PATTERN",[pattern])
print("Matches:", [doc[start:end] for match_id, start, end in matcher(doc)])

Matches: [iOS 7, iOS 11, iOS 10]


Part 2

Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [36]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

pattern = [{"LEMMA":"download"}, {"POS":"PROPN"}]
matcher.add("DOWNLOAD",[pattern])

print("Matches:", [doc[start:end] for match_id, start, end in matcher(doc)])

Matches: [downloaded Fortnite, downloading Minecraft, download Winzip]


Part 3

Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [38]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

pattern = [{"POS":"ADJ"},{"POS":"NOUN"},{"POS":"NOUN", "OP":"?"}]
matcher.add("ADJ_NOUN",[pattern])
print("Matches:", [doc[start:end] for match_id, start, end in matcher(doc)])

Matches: [beautiful design, smart search, automatic labels, optional voice, optional voice responses]
