<a href="https://colab.research.google.com/github/noircir/Python/blob/master/005_Vocabulary_and_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vocabulary and Matching
So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific phrases that match patterns we can define ourselves. 

## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
# Import the Matcher library
from spacy.matcher import Matcher
#Here matcher is an object that pairs to the current Vocab object. 
# We can add and remove specific named matchers to matcher as needed.
matcher = Matcher(nlp.vocab)

### Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [0]:
# SolarPower
pattern1 = [{'LOWER': 'solarpower'}]
# Solar power
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
# Solar-power
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

matcher.add('SolarPower_matcher', None, pattern1, pattern2, pattern3)

<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Applying the matcher to a Doc object

In [0]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity. Solar power is the future.')

In [0]:
found_matches = matcher(doc)
print(found_matches)

[(18108190749154820851, 1, 3), (18108190749154820851, 10, 11), (18108190749154820851, 13, 16), (18108190749154820851, 21, 23)]


In [0]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

18108190749154820851 SolarPower_matcher 1 3 Solar Power
18108190749154820851 SolarPower_matcher 10 11 solarpower
18108190749154820851 SolarPower_matcher 13 16 Solar-power
18108190749154820851 SolarPower_matcher 21 23 Solar power


The `match_id` is simply the hash value of the `string_ID` 'SolarPower'

### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [0]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower_matcher')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('Another_SolarPower_matcher', None, pattern1, pattern2)

In [0]:
found_matches = matcher(doc)
print(found_matches)

[(3248665728714946131, 1, 3), (3248665728714946131, 10, 11), (3248665728714946131, 13, 16), (3248665728714946131, 21, 23)]


In [0]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3248665728714946131 Another_SolarPower_matcher 1 3 Solar Power
3248665728714946131 Another_SolarPower_matcher 10 11 solarpower
3248665728714946131 Another_SolarPower_matcher 13 16 Solar-power
3248665728714946131 Another_SolarPower_matcher 21 23 Solar power


The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


### Be careful with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':

In [0]:
#pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN

# Remove the old patterns to avoid duplication:
#matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower2', None, pattern2)

In [0]:
doc3 = nlp(u'Solar-powered energy runs solar-powering vehicles.')

In [0]:
#matcher.remove('SolarPower')
#matcher.add('SolarPower2', None, pattern1, pattern2)
found_matches = matcher(doc3)
print(found_matches)

[(3587797832942597511, 0, 3), (3587797832942597511, 5, 8)]


In [0]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3587797832942597511 SolarPower2 0 3 The Solar Power
3587797832942597511 SolarPower2 5 8 to grow as


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [0]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

For this exercise we're going to import a Wikipedia article on Reaganomics
Source: https://en.wikipedia.org/wiki/Reaganomics

In [0]:
!pip install html2text

In [0]:
## Get the text of the Wiki article

import requests
#from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/Reaganomics"
html = requests.get(link).text

In [0]:
html[:100]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

In [0]:
import html2text
text = html2text.html2text(html)
print(text[11000:12000])

Nixon")'s [wage and price
controls](/wiki/Wage_and_price_controls#United_States "Wage and price
controls") were phased out.[7] The [federal oil
reserves](/wiki/Strategic_Petroleum_Reserve_\(United_States\) "Strategic
Petroleum Reserve \(United States\)") were created to ease any future short
term shocks. President [Jimmy Carter](/wiki/Jimmy_Carter "Jimmy Carter") had
begun phasing out price controls on petroleum while he created the Department
of Energy. Much of the credit for the resolution of the stagflation is given
to two causes: a three-year contraction of the money supply by the [Federal
Reserve Board](/wiki/Federal_Reserve_Board "Federal Reserve Board") under
[Paul Volcker](/wiki/Paul_Volcker "Paul Volcker"), initiated in the last year
of Carter's presidency, and long-term easing of supply and pricing in oil
during the [1980s oil glut](/wiki/1980s_oil_glut "1980s oil glut").[
_[citation needed](/wiki/Wikipedia:Citation_needed "Wikipedia:Citation
needed")_ ]

In stating that his 

In [0]:
# An example from Spacy documentation https://spacy.io/usage/rule-based-matching#phrasematcher-attrs

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
matches
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Angela Merkel
Barack Obama
Washington, D.C.


In [0]:
# Need to reset nlp and document for another attempt.
# Addition of an attribute ( attr="LOWER" ) makes our search case-insensitive

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["supply-side economics", "trickle-down economics"]

# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("oooo", None, *patterns)

doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Supply-side economics
Supply-side economics
Trickle-Down Economics
Supply-side economics


In [0]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc_reaganomics[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

231001732085949855 oooo 1218 1222 Supply-side economics
231001732085949855 oooo 2647 2651 Supply-side economics
231001732085949855 oooo 13645 13649 Trickle-Down Economics
231001732085949855 oooo 20710 20714 Supply-side economics


## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [0]:
doc_reaganomics[2620:2670]

Deal") policies. At the same time he attracted a
following from the [supply-side economics](/wiki/Supply-side_economics
"Supply-side economics") movement, which formed in opposition to
[Keynesian](/wiki/Keynesian "Keynesian") demand

Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:

In [0]:
sents = [sent for sent in doc_reaganomics.sents]
print(sents[0].start, sents[0].end)

0 19


In [0]:
for sent in sents:
    if matches[3][1] < sent.end:  # this is the fourth match, that starts at doc_reagomics[20710]
        print(sent)
        break

[Supply-side](/wiki/Supply-side_economics "Supply-side economics")
  
