## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [1]:
import spacy.cli
spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

#Defining Patterns
The next step is to define the patterns that will be used to filter similar phrases. Suppose we want to find the phrases "quick-brown-fox", "quick brown fox", "quickbrownfox" or "quick brownfox". To do so, we need to create the following four patterns:

In [4]:
p1=[ {'LOWER':'quickbrownfox'}]
p2 = [{'LOWER':'quick'} , {'IS_PUNCT':True} , {'LOWER':'brown'} , {'IS_PUNCT':True} , {'LOWER':'fox'}]
p3 = [{'LOWER':'quick'} , {'LOWER':'brown'} , {'LOWER':'fox'}]
p4 = [{'LOWER':'quick'} , {'LOWER':'brownfox'}]

In [5]:
matcher.add('QBF' , None , p1,p2,p3,p4)

In [6]:
doc = nlp('The quick-brown-fox jumps over the lazy dog. The quick brown fox eats well. \
               the quickbrownfox is dead. the dog misses the quick brownfox')

In [7]:
found_matches = matcher(doc)
print(found_matches)

[(12825528024649263697, 1, 6), (12825528024649263697, 13, 16), (12825528024649263697, 21, 22), (12825528024649263697, 29, 31)]


In [8]:
for match_id , start , end in found_matches:
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id , string_id , start , end , span.text)

12825528024649263697 QBF 1 6 quick-brown-fox
12825528024649263697 QBF 13 16 quick brown fox
12825528024649263697 QBF 21 22 quickbrownfox
12825528024649263697 QBF 29 31 quick brownfox


In [9]:
matcher.remove('QBF')

In [10]:
p1 = [{'LOWER':'quick'} , {'IS_PUNCT':True , 'OP':'*'} , {'LOWER':'brown'} , {'IS_PUNCT':True , 'OP':'*'} , {'LOWER':'fox'}]

In [11]:
matcher.add('QBF' , None , p1)

In [15]:
doc = nlp('The quick--brown--fox jumps over the  quick-brown---fox quick brown fox')

In [16]:
found_matches = matcher(doc)

In [17]:
for match_id , start , end in found_matches:
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id , string_id , start , end , span.text)

12825528024649263697 QBF 1 6 quick--brown--fox
12825528024649263697 QBF 10 15 quick-brown---fox
12825528024649263697 QBF 15 18 quick brown fox


In [None]:
#thank you!!