#### Rule-based matching

In this lesson, we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.

Why not just regular expressions?

- Match on Doc objects, not just strings
- Match on tokens and token attributes
- Use a model's predictions
- Example: "duck" (verb) vs. "duck" (noun)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use a model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

#### Match patterns

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by a model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".


- Lists of dictionaries, one per token
- Match exact token texts: ```[{"TEXT": "iPhone"}, {"TEXT": "X"}]```

- Match lexical attributes: ```[{"LOWER": "iphone"}, {"LOWER": "x"}]```

- Match any token attributes: ```[{"LEMMA": "buy"}, {"POS": "NOUN"}]```

#### Using the Matcher

1. To use a pattern, we first import the matcher from spacy.matcher.

2. We also load a pipeline and create the nlp object.

3. The matcher is initialized with the shared vocabulary, nlp.vocab. You'll learn more about this later – for now, just remember to always pass it in.

4. The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is a list of patterns.

5. To match the pattern on a text, we can call the matcher on any doc.

6. This will return the matches.

In [1]:
import spacy

  from .autonotebook import tqdm as notebook_tqdm
2023-07-22 23:08:32.721346: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-22 23:08:32.897836: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-22 23:08:33.487555: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-07-22 23:08:33.487610: W tensorflow/compiler/xla/s

In [2]:
from spacy.matcher import Matcher

In [3]:
nlp = spacy.load("en_core_web_sm")

In [6]:
matcher = Matcher(nlp.vocab)

In [8]:
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

In [9]:
doc = nlp("Upcoming iPhone X release date leaked")

In [10]:
matches = matcher(doc)

When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

- match_id: hash value of the pattern name
- start: start index of matched span
- end: end index of matched span

In [13]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text) 

iPhone X


#### Matching lexical attributes

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

- A token consisting of only digits.

- Three case-insensitive tokens for "fifa", "world" and "cup".

- And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:".

In [14]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

In [15]:
matcher.add("FIFA_WOLRD_CUP", [pattern])

In [16]:
doc = nlp("2018 FIFA World Cup: France won!")

In [17]:
matches = matcher(doc)

In [18]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


#### Matching other token attributes

In this example, we're looking for two tokens:

- A verb with the lemma "love", followed by a noun.

- This pattern will match "loved dogs" and "love cats".

In [19]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

In [20]:
matcher.add("LOVE_ANYBODY", [pattern])

In [21]:
doc = nlp("I loved dogs but now I love cats more.")

In [22]:
matches = matcher(doc)

In [23]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats
