# 1. Rule-based matching

SpaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for ,they also give you access to the tokens within the document and their relationships

This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents


In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [14]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # create matcher object and pass nlp.vocab

# Here matcher is an object that pairs to current Vocab object
# We can add and remove specific named matchers to matcher as needed

## Creating patterns

In [15]:
# create a list, and inside that list add series of dictionaries

# Hello World can appear in the following ways,
# 1) Hello World hello world HellO WORLd
# 2) Hello-World

pattern_1 = [{'LOWER': 'hello'}, {'LOWER': 'world'}] 
pattern_2 = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]  # there should be some punctuation between the words
    
# 'LOWER', 'IS_PUNCT' are the attributes
# they has to be written in  that way only

In [16]:
# Add patterns to matcher object

# Add a match rule to matcher, A match rule consists of,
# 1) An ID key
# 2) an on_match callback
# 3) one or more patterns

matcher.add('find_hello_world', None, pattern_1, pattern_2)

In [21]:
# create a document
doc = nlp(" 'Hello World' are the first two printed HellO WORLD words for hello, world most of the programmers, printing 'Hello-World' is most common for beginners")

In [22]:
doc

 'Hello World' are the first two printed HellO WORLD words for hello, world most of the programmers, printing 'Hello-World' is most common for beginners

## Finding matches

In [23]:
find_matches = matcher(doc) # passing doc to matcher object and store this in a variable 
print(find_matches)

# it returns output list of tuples
# string ID, index start and index end

[(8752807494209775780, 2, 4), (8752807494209775780, 10, 12), (8752807494209775780, 14, 17), (8752807494209775780, 24, 27)]


In [24]:
# define a function to find the matches

for match_id, start, end in find_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8752807494209775780 find_hello_world 2 4 Hello World
8752807494209775780 find_hello_world 10 12 HellO WORLD
8752807494209775780 find_hello_world 14 17 hello, world
8752807494209775780 find_hello_world 24 27 Hello-World


## Setting pattern options and qalifiers

In [32]:
# Redefine the patterns:
pattern_3 = [{'LOWER': 'hello'}, {'LOWER': 'world'}]
pattern_4 = [{'LOWER': 'hello'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'world'}]
# 'OP':'*' ----> Thisis going to allow this pattern to match zero or more times for any punctuation

# Add the new set of patterns to the 'Hellow World' matcher:
matcher.add('Hello World', None, pattern_3, pattern_4)

In [49]:
doc_2 = nlp("You can print Hello World or hello,& world or Hello-World")

In [29]:
# Removing the matches
matcher.remove('find_hello_world')

In [50]:
find_matches = matcher(doc_2)
print(find_matches)

[(8585552006568828647, 3, 5), (8585552006568828647, 6, 10), (8585552006568828647, 11, 14)]


In [51]:
for match_id, start, end in find_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc_2[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8585552006568828647 Hello World 3 5 Hello World
8585552006568828647 Hello World 6 10 hello,& world
8585552006568828647 Hello World 11 14 Hello-World


# 2. Phrase matching

In the above section we used token patterns to perform rule-based matching. An alternative and more efficient method is to match on terminology lists

In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into matcher instead


In [52]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [53]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [54]:
phrase_list = ["Barack Obama", "Angela Merkel", "Washington, D.C."]

In [55]:
# Convert each phrase to a document object
phrase_patterns = [nlp(text) for text in phrase_list] # to do that we are using list comprehension

In [56]:
phrase_patterns
# phrase objects are not strings

[Barack Obama, Angela Merkel, Washington, D.C.]

In [57]:
type(phrase_patterns[0])
# they are the spacy docs
# thats why we don't have any quotes there

spacy.tokens.doc.Doc

In [58]:
# pass each doc object into the matcher
matcher.add("TerminologyList", None, *phrase_patterns)
# we have to add asterisk mark before phrase_pattern

In [59]:
doc_3 = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

In [60]:
find_matches = matcher(doc_3) # passin doc to matcher object and store this in a variable 
print(find_matches)

[(3766102292120407359, 2, 4), (3766102292120407359, 7, 9), (3766102292120407359, 19, 22)]


In [61]:
# define a function to find the matches

for match_id, start, end in find_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc_3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3766102292120407359 TerminologyList 2 4 Angela Merkel
3766102292120407359 TerminologyList 7 9 Barack Obama
3766102292120407359 TerminologyList 19 22 Washington, D.C.
