# Vocabulary and Matching

This notebook shows how tokens of groups of tokens can be found/matched in a text. It is equivalent to applying regex, but dictionaries are used instead, making the process more powerful and probably less cryptic.

Overview of contents:
1. Rule-Based Matching: like regex to find tokens, but with rules defined using dictionaries and pre-defined keys.
    - 1.1 Pattern Options and Further Keys
2. Phrase Matching: same as before, but applied to group of words (i.e., phrases), not just single tokens.

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Rule-Based Matching

Rule-based matching is as a powerful regex; however, instead of using cryptic symbols, dictionaries are defined with pre-defined keys. This kind of matching is used to find tokens of lemmas. Note that it is better to find tokens, since the same token-string can have different lemmas (i.e., depending if the word is noun/verb/adj.).

In [34]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [35]:
# Import the Matcher library
# matcher is an object that pairs to the current Vocab object
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [36]:
# The phrase 'solar power' might appear
# as one word or two, with or without a hyphen.
# In this section we'll develop a matcher named 'SolarPower' that finds all three
pattern1 = [{'LOWER': 'solarpower'}] # 'solarpower'
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}] # 'solar' 'power'
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}] # 'solar' any punctuation (-) 'power'
# Key
# List of patterns
# Callbak: on_match = None
matcher.add('SolarPower', [pattern1, pattern2, pattern3], on_match=None)

In [37]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [38]:
# List of tuples of matches returned: (match_id, start token pos in Doc, end token pos in Doc)
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


In [39]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation of match_id: matcher Key
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


### 1.1 Pattern Options and Further Keys

With `OP`, we can pass options to the pattern definitions. For instance: `'OP':'*'` means optional

In [47]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}] # solarpower
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}] # solar*power: anything can be *

In [48]:
# Remove the old patterns to avoid duplication
matcher.remove('SolarPower')

In [49]:
# Add the new set of patterns to the 'SolarPower' matcher
matcher.add('SolarPower', [pattern1, pattern2])

In [50]:
doc2 = nlp(u"Solar--power is solarpower!")

In [51]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 5)]


The following quantifiers can be passed to the `'OP'` key:

<table><tr><th>OP</th><th>Description</th></tr>
<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


Besides `LOWER`, `IS_PUNCT` and `OP`, we can pass the following keys to the dictionaries that define the pattern lists:

<table><tr><th>Attribute</th><th>Description</th></tr>
<tr ><td><span >`LEMMA`</span></td><td>The lemma of a token; be careful: the same token might have different lemmas (e.g., depending if it is a verb/noun/adj.)</td></tr>
<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>
</table>

#### Wildcards: Hashtag Searching Example

We can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, to retrieve hashtags without knowing what might follow the `#` character:

```python
[{'ORTH': '#'}, {}]
```

## 2. Phrase Matching

Instead of matching single tokens, we can try to match  groups of words (i.e., phrases). This is more efficient and more commonly done.

Text used in this section: [Reaganomics from Wikipedia](https://en.wikipedia.org/wiki/Reaganomics).

In [82]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [83]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [84]:
with open('../data/reaganomics.txt', encoding='cp1252') as f:
    doc3 = nlp(f.read())

In [85]:
# Create a list of match phrases
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [86]:
# Convert each phrase to a Doc object
phrase_patterns = [nlp(text) for text in phrase_list]

In [87]:
phrase_patterns

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [88]:
type(phrase_patterns[0])

spacy.tokens.doc.Doc

In [89]:
# Pass each Doc object into matcher (note the use of the asterisk!)
matcher.add('VoodooEconomics', None, *phrase_patterns)

In [90]:
# Build a list of matches
matches = matcher(doc3)

In [91]:
# Display matches
# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2987, 2991)]

In [92]:
# The first 4 are in the first 70 tokens
doc3[:70]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


In [93]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation of match_id: matcher Key
    span = doc3[start-2:end+2]                    # get the matched span, expanded for context
    print(match_id, string_id, start, end, span.text)

3473369816841043438 VoodooEconomics 41 45 associated with supply-side economics, referred
3473369816841043438 VoodooEconomics 49 53 to as trickle-down economics or voodoo
3473369816841043438 VoodooEconomics 54 56 economics or voodoo economics by political
3473369816841043438 VoodooEconomics 61 65 , and free-market economics by political
3473369816841043438 VoodooEconomics 673 677 from the supply-side economics movement,
3473369816841043438 VoodooEconomics 2987 2991 as "trickle-down economics",
