# Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns. The process to use the Matcher tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match. Next, you have to add the patterns to the Matcher tool and finally, you have to apply the Matcher tool to the document that you want to match your rules with.

#### Import Spacy

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

#### Import Matcher Library

In [4]:
from spacy.matcher import Matcher

In [5]:
matcher = Matcher(nlp.vocab)

#### Define Pattern

In [7]:
p1 = [{'LOWER':'quickbrownfox'}]
p2 = [{'LOWER':'quick'},{'IS_PUNCT':True},{'LOWER':'brown'},{'IS_PUNCT':True},{'LOWER':'fox'}]
p3 = [{'LOWER':'quick'},{'LOWER':'brown'},{'LOWER':'fox'}]
p4 = [{'LOWER':'quick'},{'LOWER':'brownfox'}]

**TAKE a Note :
1. p1 looks for the phrase "quickbrownfox"
2. p2 looks for the phrase "quick-brown-fox"
3. p3 tries to search for "qucik brown fox"
4. p4 looks for the phrase "quick brownfox"

**The token attribute LOWER defines that the phrase should be converted into lower case before matching.

In [8]:
matcher.add('QBF',None,p1,p2,p3,p4)

#### QBF is our matcher. Test the matcher upon a Document.

In [10]:
sentence = nlp(u'The quick-brown-fox jumps over the lazy dog. The quick brown fox eats well. \
               the quickbrownfox is dead. the dog misses the quick brownfox')

In [11]:
phrase_matches = matcher(sentence)

In [12]:
print(phrase_matches)

[(12825528024649263697, 1, 6), (12825528024649263697, 13, 16), (12825528024649263697, 21, 22), (12825528024649263697, 29, 31)]


** From the output, you can see that four phrases have been matched. The first long number in each output is the id of the phrase matched, the second and third numbers are the starting and ending positions of the phrase.

#### Create script to view in better way.

In [14]:
for hash, start,end in phrase_matches:
    hash_id = nlp.vocab.strings[hash]
    span = sentence[start:end]
    print(hash, hash_id, span, start, end, span.text)

12825528024649263697 QBF quick-brown-fox 1 6 quick-brown-fox
12825528024649263697 QBF quick brown fox 13 16 quick brown fox
12825528024649263697 QBF quickbrownfox 21 22 quickbrownfox
12825528024649263697 QBF quick brownfox 29 31 quick brownfox


#### Set as RegEx Style 

For instance, the "*" attribute is defined to search for one or more instances of the token.

Let's write a simple pattern that can identify the phrase "quick--brown--fox" or quick-brown---fox.

In [15]:
matcher.remove('QBF')

In [17]:
p1 = [{'LOWER': 'quick'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'brown'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'fox'}]
matcher.add('QBF', None, p1)

In [20]:
sentence = nlp(u'The quick--brown--fox jumps over the  quick-brown---fox')

In [21]:
phrase_matches = matcher(sentence)

In [22]:
for hash, start,end in phrase_matches:
    hash_id = nlp.vocab.strings[hash]
    span = sentence[start:end]
    print(hash, hash_id, span, start, end, span.text)

12825528024649263697 QBF quick--brown--fox 1 6 quick--brown--fox
12825528024649263697 QBF quick-brown---fox 10 15 quick-brown---fox


# PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [23]:
import spacy
from spacy.matcher import PhraseMatcher

In [31]:
matcher = PhraseMatcher(nlp.vocab)

In [26]:
with open('reaganomics.txt', encoding= 'unicode_escape') as f:
    doc2 = nlp(f.read())

In [29]:
phrase_list = ['vodoo economics','supply-side economics','trickle-down economics','free-market economics']

In [30]:
phrase_pattern = [nlp(text) for text in phrase_list]

In [32]:
phrase_pattern

[vodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [35]:
matcher.add('EMatcher',None,*phrase_pattern)

In [37]:
found_matches = matcher(doc2)

In [39]:
for hash, start,end in found_matches:
    hash_id = nlp.vocab.strings[hash]
    span = doc2[start:end]
    print(hash, hash_id, span, start, end, span.text)

4361118297309102001 EMatcher supply-side economics 41 45 supply-side economics
4361118297309102001 EMatcher trickle-down economics 49 53 trickle-down economics
4361118297309102001 EMatcher free-market economics 61 65 free-market economics
4361118297309102001 EMatcher supply-side economics 673 677 supply-side economics
4361118297309102001 EMatcher trickle-down economics 2986 2990 trickle-down economics
