# Rule-based Matching

Allows you to write rules to find phrases/words in text.

Why would I use this over RegEx?

1. Can match on `doc` objects, and not "just strings"
2. More flexible, i.e. can search for text or other lexical attributes
3. Use model predictions for better rules, e.g. find "duck" as a verb, not as a noun

Available attributes you can specify are listed [here](https://spacy.io/usage/rule-based-matching).

| Attribute                             | Description                                                                         |
| ------------------------------------- | ----------------------------------------------------------------------------------- |
| `ORTH`                                | The exact verbatim text of a token.                                                 |
| `TEXT`                                | The exact verbatim text of a token.                                                 |
| `LOWER`                               | The lowercase form of the token text.                                               |
| `LENGTH`                              | The length of the token text.                                                       |
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`    | Token text consists of alphanumeric characters, ASCII characters,digits.            |
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE`    | Token text is in lowercase, uppercase, titlecase.                                   |
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP`     | Token is punctuation, whitespace, stop word.                                        |
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`  | Token text resembles a number, URL, email.                                          |
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | The token’s simple and extended part-of-speech tag, dependency label, lemma, shape. |
| `NT_TYPE`                             | The token’s entity label.                                                           |
| `_`                                   | Properties in custom extension attributes.                                          |

## Examples of matchers for "iPhone X"

In [None]:
# Exact match
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Match lowercase form of token text
pattern = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Match w/ other token attributes
pattern = [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

## Setup the Matcher

In [2]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

## Add a pattern to the matcher

Takes the form: `matcher.add(unique_ID, callback_optional, pattern)`

In [3]:
# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

## Use the Matcher

- `match_id`: hash value of the pattern name
- `start`: start index of matched span
- `end`: end index of matched span

In [4]:
# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


## Operators and Quantifiers

Operators and quantifiers let you define how often a token should be matched. They can be added using the `"OP"` key.

- `{'OP': '!'}` - Negation: match 0 times
- `{'OP': '?'}` - Optional: match 0 or 1 times
- `{'OP': '+'}` - Match 1 or more times
- `{'OP': '*'}` - Match 0 or more times

In [54]:
pattern = [
           {'LOWER': 'ejection'},
           {'LOWER': 'fraction'},
           {'POS': 'VERB', 'OP': '?'},
           {'IS_DIGIT': True}
          ]

matcher.add('EF_PATTERN', None, pattern)

# Process some text
doc = nlp("The patient's ejection fraction is 65%")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

fraction
ejection fraction is 65
65


In [52]:
# List comprehension
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['fraction', 'ejection fraction is 65', '65']


# Example - World Cup

In [14]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")
matcher.add('WC_PATTERN', None, pattern)

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In [15]:
# Faster approach - List comprehension

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['2018 FIFA World Cup:']


# Example - Operators/Quantifiers

In [17]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher.add('BUYING', None, pattern)

matches = matcher(doc)

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['bought a smartphone', 'buying apps']


# Example - "iOS <number>" matching

In [21]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": 'iOS'}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

Total matches found: 3


In [22]:
# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [23]:
# List comprehension
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iOS 7', 'iOS 11', 'iOS 10']


# Example - "download something"

In [25]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [26]:
# List comprehension
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['downloaded Fortnite', 'downloading Minecraft', 'download Winzip']


# Example - pronoun followed by 1 (or 2) nouns

In [27]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": 'ADJ'}, {"POS": 'NOUN'}, {"POS": 'NOUN', "OP": '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


In [28]:
# List comprehension
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['beautiful design', 'smart search', 'automatic labels', 'optional voice responses']
