## **Combining predictions and rules**

|                         | Statistical models                                | Rule-based systems                               |
|-------------------------|---------------------------------------------------|--------------------------------------------------|
| **Use cases**           | application needs to generalize based on examples | dictionary with finite number of examples        |
| **Real-world examples** | product & person names, subject/object relations  | countries of the world, cities, drug names, etc. |
| **spaCy features**      | entity recognition, dependency parser, pos tagger | tokenizer, `Matcher`, `PhraseMatcher`            |

**What rule-based matching looked like**

In [1]:
# Initialize with the shared vocab
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

In [2]:
# Patterns are lists of dictionaries describing the tokens
pattern = [
  { "LEMMA": "love" },
  { "POS": "VERB" },
  { "LOWER": "cats" }
]
matcher.add("LOVE_CATS", [pattern])

In [3]:
# Operators can specify how often a token should be matched
pattern = [
  { "TEXT": "very" },
  { "OP": "+" },
  { "TEXT": "happy" }
]
matcher.add("VERY_HAPPY", [pattern])

In [4]:
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

In [6]:
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['very very happy']


**Adding statistical predictions**

In [8]:
matcher = Matcher(nlp.vocab)
matcher.add("DOG", [[{ "LOWER": "golden" }, { "LOWER": "retriever" }]])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
  span = doc[start:end]
  print("Matched span:", span.text)

  # Get the span's root token and root head token
  print("Root token:", span.root.text)
  print("Root head token:", span.root.head.text)

  # Get the previous token and its POS tag
  print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


### **Efficient phrase matching**

- `PhraseMatcher` like regex or keyword search - but with access to the tokens
  - Helpful tool in finding sequences of words in data
- Takes `Doc` object as patterns
- More efficient and faster than the `Matcher`
- Great for matching large word lists

In [9]:
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_lg")
matcher = PhraseMatcher(nlp.vocab)

In [10]:
pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

In [11]:
# Iterate over the matches
for match_id, start, end in matcher(doc):
  # Get the matched span
  span = doc[start:end]
  print("Matched span:", span.text)

Matched span: Golden Retriever
