# Vocabulary and Matching

So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

Now, we will identify and label specific phrases that match patterns we can define ourselves. 

## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.


In [1]:
import spacy
nlp = spacy.load('en')

In [2]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed

In [3]:
pattern1 = [{'LOWER':'solarpower'}]
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern3` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern2` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

Remember that single spaces are not tokenized, so they don't count as punctuation.

Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` 

In [4]:
matcher.add('SolarPower',None,pattern1,pattern2,pattern3) # The three patterns defined above have been added to the matching object and they are under the name 'SolarPower'.

In [5]:
doc = nlp(u"The Solar Power industry continues to grow as solarpower increases. Solar-power is amazing.")

In [6]:
found_matches = matcher(doc)

In [7]:
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]


In [8]:
for match_id,start,end in found_matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id,string_id,start,end,span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


In [9]:
matcher.remove('SolarPower') # Will remove the loaded matcher with the name SolarPower as we had defined above and under whose name we loaded 3 patterns namely pattern 1,pattern2,pattern 3 .

In [10]:
pattern1=[{'LOWER':'solarpower'}] 
pattern2=[{'LOWER':'solar'},{'IS_PUNCT':True},{'OP':'*'},{'LOWER':'power'}] # O.P. : * means making a token rule optional.

In [11]:
matcher.add('SolarPower',None,pattern1,pattern2)

In [12]:
doc2 = nlp(u"Solar--power is solarpower yay!")

In [13]:
found_matches = matcher(doc2)

In [14]:
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


In [15]:
for match,s,e in found_matches:
        string_id = nlp.vocab.strings[match]
        span = doc[s:e]
        print(match,string_id,s,e,span.text)

8656102463236116519 SolarPower 0 3 The Solar Power
8656102463236116519 SolarPower 4 5 continues


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [16]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [17]:
with open('C:/Users/Admin/Desktop/Python Notebooks/Machine Learning Theory/Natural Language Processing/NLP with Python/reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [18]:
phrase_list = ['voodoo economics','supply-chain economics','trickle-down economics','free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list] # Now we have a list of bunch of docs and phrase patterns

phrase_patterns # By the output of phrase_patterns, it looks like a bunch of strings in the list, but it is not the case.


[voodoo economics,
 supply-chain economics,
 trickle-down economics,
 free-market economics]

In [19]:
type(phrase_patterns[0]) # Since the elements in the list are not strings but a doc spacy token, we will use asterisk to pass each doc object of this list into the matcher

spacy.tokens.doc.Doc

In [20]:
# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns) # It grabs each document and passes it individually into this matcher as a pattern.

# Build a list of matches:
matches = matcher(doc3)

In [21]:
# (match_id, start, end)
matches

[(3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 2987, 2991)]

In [23]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc3[start-5:end+5]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3473369816841043438 VoodooEconomics 49 53 economics, referred to as trickle-down economics or voodoo economics by political
3473369816841043438 VoodooEconomics 54 56 trickle-down economics or voodoo economics by political opponents, and
3473369816841043438 VoodooEconomics 61 65 by political opponents, and free-market economics by political advocates.


3473369816841043438 VoodooEconomics 2987 2991 became widely known as "trickle-down economics", due to the


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>