### Readme:

This notebook documents what I learnt from https://course.spacy.io/en/. It contains notes/sample codes/sample problems from the course. 

Special thanks to the content creators and the presenter Ines

This notebook is intended for self-study, not re-distributing contents. 
If you want to learn more about spaCy, please visit https://spacy.io/ or https://course.spacy.io/en/

Thank you!

### Chapter 1: Finding words, phrases, names and concepts

#### 1: Documents/Spans/Tokens

In [19]:
from spacy.lang.en import English
nlp = English()
doc = nlp("Hello World!")
for token in doc:
    print(token.text)

Hello
World
!


In [20]:
print(doc.text)

Hello World!


In [21]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()
# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [22]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


In [23]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# Select the first token
first_token = doc[0]
# Print the first token's text
print(first_token.text)

I


In [25]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


In [26]:
token = doc[1]
token

like

#### 2. Lexical Attributes

In [27]:
doc = nlp("It costs $5.")

In [28]:
print([token.i for token in doc])  

[0, 1, 2, 3, 4]


In [29]:
print([token.text for token in doc])

['It', 'costs', '$', '5', '.']


In [30]:
print([token.is_alpha for token in doc]) # alpha just means letters, alphanumeric, any combo of letter and number

[True, True, False, False, False]


In [31]:
print([token.is_punct for token in doc])

[False, False, False, False, True]


In [12]:
print([token.like_num for token in doc])

[False, False, False, True, False]


In [34]:
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


#### 3.Statistical Models & Model Packages

- enable spaCy to predict lingustic attributes in context
    * POS tags
    * Syntactic dependencies
    * Named entities

- train on labled example texts
- can be updated with more examples to fine-tune predictions

In [39]:
# Model Packages
# pre-trained model packages and see its predictions in action
import spacy
nlp = spacy.load("en_core_web_sm")
#binary weight to make statistical predictions
#strings of the model's vocabulary and their hashes
# meta info(language, pipeline, license)
# statistical models allow you to generalize based on a set of training
# examples, once they're trained, they use binary weights to make predictions
# so the labelled data that the model was trained on was not included in the model packages

In [42]:
doc = nlp("She ate the pizza")
for token in doc:
    print(token.text, token.pos_,token.dep_, token.head.text) # predicted pos tag

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [48]:
doc = nlp("Apple is looking to buy a U.K. startup for $1B dollar!")
for ent in doc.ents:
    print(ent.text, ent.label_, spacy.explain(ent.label_))

Apple ORG Companies, agencies, institutions, etc.
U.K. GPE Countries, cities, states
$1B dollar MONEY Monetary values, including unit


In [49]:
import spacy

# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [50]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [52]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [53]:
#Models are statistical and not always right
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
# Process the text
doc = nlp(text)
# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)
# Get the span for "iPhone X"
iphone_x = doc[1:3]
# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


#### 4. Rule-based Matching
- why not just regex?
     * match on Doc objects, not just strings
     * match on tokens and token attributes
     * use the model's predictions
     * verb v.s nouns(eg.duck)
- Matching patterns
     * lists of dics, one per token
     * match exact token texts
         * [{"TEXT":"iphone", "TEXT" : "X"}]
     * match lexical attr
         * [{"LOWER":"iphone", "LOWER" : "X"}]
     * match any token attr
         * pos/lemma
         * [{"LEMMA":"buy"},{"POS":"NOUN"}]
             * buy apples
             * bought computers
      * quantifiers
          * {"OP" :"!"} (negation, match 0 times)
          * {"OP" :"?"} (optional : match 0 or 1 times)
          * {"OP" :"+"} (match 1 or more times)
          * {"OP" :"*"} (match 0 or more times)

In [56]:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab) # initiate the matcher with the shared vocab

In [59]:
pattern = [{"TEXT":"iPhone"}, {"TEXT":"X"}]

In [63]:
matcher.add("IPHONE_PATTERN",[pattern])

In [64]:
doc = nlp("I have iPhone X")
matches = matcher(doc)

In [68]:
for match_id, start, end in matches: # match_id is the hash value of the pattern_name
    word = doc[start:end]
    print(match_id)
    print(word.text)  # iphone follows by X

9528407286733565721
iPhone X


In [100]:
pattern1 = [
   {"IS_DIGIT": True},
   {"LOWER": 'cat'},
    {"IS_PUNCT": True}
]

doc = nlp("we adopted 1 cat!")
# matcher.add("CoCo_Pattern",[pattern1])
matches = matcher(doc)
for match_id, start, end in matches: # match_id is the hash value of the pattern_name
    word = doc[start:end]
    # print(match_id)
    print(word.text)  # create one. if multple match_id/add/pattern_names are created for this, all matches will be returned.

1 cat!
1 cat!
1 cat!
1 cat!
1 cat!
1 cat!
1 cat!


In [98]:
pattern = [
    {"LEMMA" : "love", "POS" : "VERB"},
    {"POS":"NOUN"}
]
doc = nlp("We loved dogs ans we loved cats")
matcher.add("Love_Pattern",[pattern])
matches = matcher(doc)
for match_id, start, end in matches: # match_id is the hash value of the pattern_name
    word = doc[start:end]
    # print(match_id)
    print(word.text)

loved dogs
loved cats


In [99]:
pattern = [
    {"LEMMA" : "buy"}, 
    {"POS" : "DET", "OP" :"?"}, # optional, 0 or 1 times
    {"POS":"NOUN"}
]
     
doc = nlp("bought a phone. buy apps")
matcher.add("buy_Pattern",[pattern])
matches = matcher(doc)
for match_id, start, end in matches: # match_id is the hash value of the pattern_name
    word = doc[start:end]
    # print(match_id)
    print(word.text)

bought a phone
buy apps


In [102]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT":"iPhone"},
           {"TEXT":"X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


In [103]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [106]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [107]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
