# Pattern matching in spaCy

Looking at the token matcher and phrase matcher as well as some tricks to speed up the matcher.

This notebook is based on https://spacy.io/usage/rule-based-matching, which also goes into the dependency matcher and the entity ruler.

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher

In [2]:
nlp = spacy.load("en_core_web_sm")

## Matching on tokens

Here we define patterns by using a dicitonary for each token. The following patterns matches 'iPhone X':


```json
[{"TEXT": "iPhone"}, {"TEXT": "X"}]),
```

Instead of accessing the text you can access many of the features on a token, the following matches "2018 FIFA World Cup":

```json
[{"IS_DIGIT": True}, {"LOWER": "fifa"}, {"LOWER": "world"}, {"LOWER": "cup"}]
```

You can access parts of speech and lemmas:

```json
[{"LEMMA": "love", "POS": "VERB"}, {"POS": "NOUN"}]
```

And use some Kleene operators (possible values are "!", "?", "*" and "+", where "!" is negation, as in, no match):

```json
[{"LEMMA": "buy"}, {"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
```

In [3]:
class SimpleMatcher(object):

    """A simple matcher that you give a list of named pattern."""

    def __init__(self, patterns):
        self.matcher = Matcher(nlp.vocab)
        for pattern_name, pattern in patterns:
            self.matcher.add(pattern_name, [pattern])

    def run(self, sentence):
        doc = nlp(sentence)
        spacy.displacy.render(doc, options={"word_spacing": 30, "distance": 100})
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            pattern_name = self.matcher.vocab.strings[match_id]
            print(f"{pattern_name:12} {start:2} {end:2}  [{doc[start:end]}]")
            
patterns = [

    ("ROOT", [{"DEP": "ROOT"}]),
    ("IPHONE", [{"TEXT": "iPhone"}, 
                {"TEXT": "X"}]),
    ("WORLD_CUP", [{"IS_DIGIT": True},
                   {"LOWER": "fifa"},
                   {"LOWER": "world"},
                   {"LOWER": "cup"},
                   {"IS_PUNCT": True}]),
    ("LOVE_THING", [{"LEMMA": "love", "POS": "VERB"},
                    {"POS": "NOUN"}]),
    ("BUY_STUFF", [{"LEMMA": "buy"},
                   {"POS": "DET", "OP": "?"},
                   {"POS": "NOUN"}])
]

matcher = SimpleMatcher(patterns)

In [4]:
matcher.run("Upcoming iPhone X release date leaked")

IPHONE        1  3  [iPhone X]
ROOT          4  5  [date]


In [5]:
matcher.run("2018 FIFA World Cup: France won")

ROOT          3  4  [Cup]
WORLD_CUP     0  5  [2018 FIFA World Cup:]
ROOT          6  7  [won]


In [6]:
matcher.run("I loved dogs but now I love cats more.")

ROOT          1  2  [loved]
LOVE_THING    1  3  [loved dogs]
LOVE_THING    6  8  [love cats]


In [7]:
matcher.run("I bought a smartphone. Now I am buying apps.")

ROOT          1  2  [bought]
BUY_STUFF     1  4  [bought a smartphone]
ROOT          8  9  [buying]
BUY_STUFF     8 10  [buying apps]


## Matching on phrases

https://spacy.io/usage/rule-based-matching#phrasematcher

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

In [8]:
# With attr="LOWER" you do a full lowercase match, without it we match on the text by default
# Note that you cannot lower case the "D.C" part because it changes the tokenization, so using
# lower is of limited value.
pmatcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["barack obama", "angela merkel", "washington, D.C."]

# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
pmatcher.add("TerminologyList", patterns)

# same here, use make_doc if you do not need anything beyond tokenization 
doc = nlp.make_doc(
    "German Chancellor Angela Merkel and US President Barack Obama "
    "converse in the Oval Office inside the White House in Washington, D.C.")

matches = pmatcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(start, end, span.text)

2 4 Angela Merkel
7 9 Barack Obama
19 22 Washington, D.C.


### Some speed bench marks

Using `make_doc()` gives a speed boost. For an additional boost, you can also use the `nlp.tokenizer.pipe()` method, which will process the texts as a stream. This boost is not as spectacular as bypassing the entire piple line. In one run using the code below the elapsed time (wall time) went from 1.24s to 3.58 ms to 2.70 ms, actual values will vary each time you run it.

In [9]:
LOTS_OF_TERMS = ["Barack Obama", "Angela Merkel", "Washington, D.C."] * 100

In [10]:
%%time
patterns = [nlp(term) for term in LOTS_OF_TERMS]

CPU times: user 1.26 s, sys: 13.3 ms, total: 1.27 s
Wall time: 1.3 s


In [11]:
%%time
patterns = [nlp.make_doc(term) for term in LOTS_OF_TERMS]

CPU times: user 3.89 ms, sys: 808 µs, total: 4.7 ms
Wall time: 4.79 ms


In [12]:
%%time
patterns = list(nlp.tokenizer.pipe(LOTS_OF_TERMS))

CPU times: user 2.85 ms, sys: 63 µs, total: 2.91 ms
Wall time: 2.94 ms
