## Introduction To Spacy

In [1]:
#Import the language class
from spacy.lang.en import English

#Create an NLP object
nlp = English()

- nlp obj contains all the different components in the processing pipeline 
- includes lang specific rules for tokenization
- spacy supports multiple languages (available in `spacy.lang`)

In [2]:
#"Doc" object is created by processing a string of text with an nlp object
doc = nlp("Hello world!")

#iterate over tokens in a doc
for token in doc:
    print(token.text)

Hello
world
!


- Doc object behaves like a normal python seq

In [3]:
#get token at specific position 
token = doc[1]

print(token.text)

#view of doc - span object
span = doc[1:3]

print(span.text)

world
world!


In [4]:
doc = nlp("It costs $five.")
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc]) #5 or five

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', 'five', '.']
is_alpha: [True, True, False, True, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


> "These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context."

In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


In [6]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## Statistical Models

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes 
- part-of speech tags 
- syntactic dependencies 
- named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

In [7]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

spaCy provides a number of pre-trained model packages you can download using the `spacy download`

About **en_core_web_sm**: <br>
The package provides the binary weights that enable spaCy to make predictions. <br>
It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

In [21]:
# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


 In spaCy, attributes that return `strings` usually end with an underscore – attributes without the underscore return an `integer ID` value. <br>
 Example : 
 <br>
- For each token in the doc, we can print the text and the `.pos_` attribute, the predicted part-of-speech tag. 
- The `.dep_` attribute returns the predicted dependency label.
- The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.


In [10]:
spacy.explain("DET")

'determiner'

In [17]:
spacy.explain("dobj")

'direct object'

Named entities are **real world objects** that are assigned a name – for example, a person, an organization or a country. <br>

The `doc.ents` property lets you access the named entities predicted by the model. <br>

It returns an iterator of `Span` objects, so we can print the entity text using `.text` and the entity label using the `.label_` attribute. <br>

In this case, the model is correctly predicting **Apple** as an organization, **U.K.** as a geopolitical entity and **$1 billion** as money.

In [22]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


Quick Question : <br>
**What’s included in a model package that you can load into spaCy?**

- A meta file including the language, pipeline and license. (To predict linguistic annotations like part-of-speech tags, dependency labels or named entities, models include binary weights.)

- Binary weights to make statistical predictions.(o predict linguistic annotations like part-of-speech tags, dependency labels or named entities, models include binary weights.)

- Strings of the model's vocabulary and their hashes. (Model packages include a strings.json that stores the entries in the model’s vocabulary and the mapping to hashes. This allows spaCy to only communicate in hashes and look up the corresponding string if needed.)

The labelled data that the model was trained on **is not included**.(Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.)

**Example**

In [25]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    token_head = token.head.text
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}{token_head:<10}")

It          PRON      nsubj     ’s        
’s          VERB      ccomp     is        
official    ADJ       dobj      ’s        
:           PUNCT     punct     is        
Apple       PROPN     nsubj     is        
is          AUX       ROOT      is        
the         DET       det       company   
first       ADJ       amod      company   
U.S.        PROPN     nmod      company   
public      ADJ       amod      company   
company     NOUN      attr      is        
to          PART      aux       reach     
reach       VERB      relcl     company   
a           DET       det       value     
$           SYM       quantmod  trillion  
1           NUM       compound  trillion  
trillion    NUM       nummod    value     
market      NOUN      compound  value     
value       NOUN      dobj      reach     


## Rule-Based Matching

In [28]:
spacy.explain("POS")

'possessive ending'

*Note : Read about stemming and lemmitization* <br>
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html <br>
https://towardsdatascience.com/stemming-of-words-in-natural-language-processing-what-is-it-41a33e8996e2 <br>
https://towardsdatascience.com/lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6 <br>

__Unrelated__ : <br>
https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619 <br>


Why not just regular expressions? <br>
- Match on Doc objects, not just strings
- Match on tokens and token ('s lexical) attributes
- Write rules that use the model's predictions  : <br>
Example: "duck" (verb) vs. "duck" (noun)

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.
Example : <br>
- Match exact token texts 
`[{"TEXT": "iPhone"}, {"TEXT": "X"}]` (case sensitive)
- Match lexical attributes
`[{"LOWER": "iphone"}, {"LOWER": "x"}]` (case insensitive)
- Match any token attributes
`[{"LEMMA": "buy"}, {"POS": "NOUN"}]` (use attributes predicted by model : would match phrases like "buying milk" or "bought flowers".)

In [33]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab ALWAYS PASS IT IN... MORE INFO LATER
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

#matcher.add("UniqueID", optionalCallback, pattern_variable)
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc) # returns list of tuples (match_ID, start_index, end_index) of matched span

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print("MatchedText: " +matched_span.text)
    print("Match_ID of pattern: " + str(match_id))

    

Text: iPhone X
Match_ID of pattern: 9528407286733565721


Function: <br>
`matcher(doc)` <br>

Returns: <br>
- match_id: hash value of the pattern name
- start: start index of matched span
- end: end index of matched span

In [37]:
#Matching lexical attributes
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
doc = nlp("2018 FIFA World Cup: France won!")

#add pattern to matcher
matcher.add("FIFA_CUP_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc) 
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print("MatchedText: " +matched_span.text)
    print("Match_ID of pattern: " + str(match_id))


Text: 2018 FIFA World Cup:
Match_ID of pattern: 10682975751454897768


In [38]:
#Matching other token attributes
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")

#add pattern to matcher
matcher.add("LOVE_NOUN_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc) 
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print("MatchedText: " +matched_span.text)
    print("Match_ID of pattern: " + str(match_id))

MatchedText: loved dogs
Match_ID of pattern: 7253449245750226361
MatchedText: love cats
Match_ID of pattern: 7253449245750226361


| Example | Description |       
| :---: |:---:| 
|`{"OP": "!"}` |Negation: match 0 times|
|`{"OP": "?"}` |Optional: match 0 or 1 times|
|`{"OP": "+"} |Match 1 or more times|
|`{"OP": "*"}` |Match 0 or more times|

In [39]:
#Using operators and quantifiers 
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher.add("BUY_NOUN_PATTERN",None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print("Matched Text : " + matched_span.text)

Matched Text : bought a smartphone
Matched Text : buying apps


**Examples**

In [42]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [43]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

#Winzip Missing in Output

Total matches found: 2
Match found: downloaded Fortnite
Match found: downloading Minecraft


In [45]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


<hr>

#### Chapter 1.9

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

Process the text with the nlp object.
Iterate over the entities and print the entity text and label.
Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually. <br>
For example :

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)