### Introduction to SpaCy

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".
For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.
It contains all the different components in the pipeline.

In [1]:
# Import SpaCy's English language Class

from spacy.lang.en import English

# Create NLP Object

nlp = English()

In [2]:
nlp

<spacy.lang.en.English at 0x13131d1d0>

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index.

In [3]:
doc = nlp('The world is affected by COVID19')

for token in doc:
    print(token.text)

The
world
is
affected
by
COVID19


**Token objects** represent the tokens in a document – for example, a word or a punctuation character.
To get a token at a specific position, you can index into the Doc.

In [4]:
token_1 = doc[1]

print(token_1)

world


A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation

In [6]:
doc = nlp('Liverpool will be Premier League Champions soon!')

# slice from Doc Object is Span Object

span = doc[1:4]

print(span.text)

will be Premier


"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

This attributes are called lexical attributes.

In [11]:
doc = nlp('Liverpool Football Club are the 2019 World Champions!!')

print('Index: ', [token.i for token in doc])
print('Text: ', [token.text for token in doc])

print('is_alpha: ', [token.is_alpha for token in doc])
print('is_punct: ', [token.is_punct for token in doc])
print('like_num: ', [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Text:  ['Liverpool', 'Football', 'Club', 'are', 'the', '2019', 'World', 'Champions', '!', '!']
is_alpha:  [True, True, True, True, True, False, True, True, False, False]
is_punct:  [False, False, False, False, False, False, False, False, True, True]
like_num:  [False, False, False, False, False, True, False, False, False, False]



Example: Finding % symbol in a text

In [12]:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over tokens in documents
for token in doc:
    # Check if token resembles a number
    if token.like_num:
        next_token = doc[token.i + 1]       
        # Check if the next token is equal to '%'       
        if next_token.text == '%':
            print('Percentage Found:', token.text)

Percentage Found: 60
Percentage Found: 4


### Statistical Models

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.
Models are trained on large datasets of labeled example texts.
They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

#### Model Packages

spaCy provides a number of pre-trained model packages you can download using the "spacy download" command. For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The package provides the binary weights that enable spaCy to make predictions.


In [14]:
# Importing Pre-Trained English Class Model

import spacy

nlp = spacy.load('en_core_web_sm')

##### Predicting the Part of Speech

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.
In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

In [16]:
doc = nlp('Liverpool will win the Premier League soon')

for token in doc:
    print(token.text , token.pos_)

Liverpool PROPN
will VERB
win VERB
the DET
Premier PROPN
League PROPN
soon ADV


##### Predicting Syntactic Dependencies

In [17]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Liverpool PROPN nsubj win
will VERB aux win
win VERB ROOT win
the DET det League
Premier PROPN compound League
League PROPN dobj win
soon ADV advmod win


In [19]:
from spacy import displacy

displacy.render(doc, style='dep')

In [25]:
spacy.explain('advmod')

'adverbial modifier'

##### Predicting Name Entities

In [50]:
doc= nlp("The CEO of Apple is Tim Cook. "
        "Its official: Apple is the first U.S. public company to reach a $1 trillion market value")

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
Tim Cook PERSON
Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [51]:
spacy.explain('GPE')

'Countries, cities, states'

In [52]:
displacy.render(doc, style="ent")

In [53]:
doc = nlp("The CEO of Apple is Tim Cook. "
        "Its official: Apple is the first U.S. public company to reach a $1 trillion market value")

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

The         DET       det       
CEO         NOUN      nsubj     
of          ADP       prep      
Apple       PROPN     pobj      
is          VERB      ROOT      
Tim         PROPN     compound  
Cook        PROPN     attr      
.           PUNCT     punct     
Its         DET       poss      
official    NOUN      dep       
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

In [54]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for token in doc.ents:
    # Print the entity text and label
    print(token.text, token.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### Rule-Based Matching

Now we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.
Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
It's also more flexible: you can search for texts but also other lexical attributes.
You can even write rules that use the model's predictions.

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.


In [61]:
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

In [62]:
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

In [63]:
matches

[(9528407286733565721, 1, 3)]

When you call the matcher on a doc, it returns a list of tuples. \
Each tuple consists of three values: *the match ID, the start index and the end index of the matched span.*
This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

In [64]:
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


##### Matching Lexical Attributes

In [67]:
pattern = [
    #{'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    #{'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('FIFA', None, pattern)

# Process some text

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

FIFA World Cup


Matching pattern : 'A verb with the lemma "love", followed by a noun.'

In [68]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

doc = nlp("I loved dogs but now I love cats more.")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('PETS', None, pattern)

# Process some text

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


##### Using Operations and Quantifiers

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key. \
Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [69]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('CONSUME', None, pattern)

# Process some text

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


{'OP': '!'} Negation: match 0 times  \
{'OP': '?'} Optional: match 0 or 1 times \
{'OP': '+'} Match 1 or more times \
{'OP': '\*'} Match 0 or more times

Example : Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [70]:
pattern = [
    {'ORTH' : 'iOS'},
    {'IS_DIGIT' : True}
]

doc = nlp(
     "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

matcher = Matcher(nlp.vocab)

matcher.add('iOS',None, pattern)

matches = matcher(doc)

In [71]:
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Example: Write one pattern to match 'download proper_noun(POS)' where download is a lemma

In [73]:
pattern = [
    {'LEMMA': 'download'},
    {'POS' : 'PROPN'}
]

doc = nlp(
     "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

matcher.add('Download_Pattern',None,pattern)

matches = matcher(doc)

In [74]:
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip
