# Spacy Matcher

* [Rule-based matching](https://spacy.io/usage/rule-based-matching)
* [Matcher](https://spacy.io/api/matcher)
* [Rule-based Matcher Explorer](https://demos.explosion.ai/matcher)

In [16]:
%%html
<style>
table {float:left}
</style>

In [31]:
from typing import (
    List, 
    Dict,
    Tuple
)
import json

import spacy
from spacy.matcher import Matcher

# Language Model

In [2]:
nlp = spacy.load("en_core_web_lg")
vocabulrary: spacy.vocab.Vocab = nlp.vocab

# Matcher

The matcher must always share the same ```vocab``` of the documents it will operate on. Use the vocabulrary of the language model.

In [None]:
matcher = Mathcher(vocabulrary)

## Pattern (Token Sequence)

Pattern is to find a sequence of token**s** whose [attributes](https://spacy.io/api/attributes) match the rules defined. 

* each **pattern** is a list of rules sequenced as ```AND``` logic.
* each **rule** is a dictionary listing one or more expression ```{expression+}```.
* each **expression** can have ```token-attribute : value [: operator]``` where ```operator``` is a regular expression repetition operator.

```
pattern=[
    {expression},
    {expression},
    ...
]
```

### Operator

To match exactly once, omit the OP.


| OP     |                                                                     |
|:-------|:--------------------------------------------------------------------|
| !      | Negate the pattern, by requiring it to match exactly 0 times.       |
| ?      | Make the pattern optional, by allowing it to match 0 or 1 times.    |
| +      | Require the pattern to match 1 or more times.                       |
| *      | Allow the pattern to match zero or more times.                      |
| {n}    | Require the pattern to match exactly n times.                       |
| {n,m}  | Require the pattern to match at least n but not more than m times.  |
| {n,}   | Require the pattern to match at least n times.                      |
| {,m}   | Require the pattern to match at most m times.                       |

### Single Token Match 
Example to find one token whose ```POS``` is **noun**, ```lemma``` is **match**, in ```LOWER``` case as **matches**.

In [56]:
rule = {
    'POS': 'NOUN', 
    'LEMMA': 'match',
    'LOWER': 'matches',
    'OP': '?',
}

pattern: List[Dict[str, str]] = [
    rule    
]
print(json.dumps(pattern, indent=4))

[
    {
        "POS": "NOUN",
        "LEMMA": "match",
        "LOWER": "matches",
        "OP": "?"
    }
]


In [21]:
matcher.add(
    "find_noun_matches", 
    [
        pattern
    ]
)

Find the matches.

In [53]:
text: str = """
A match starts a fire. Modern matches are small wooden sticks.
Regex \w+es matches plurals.
Little Girl Selling Matches is about a girl selling matches dying.
"""
doc = nlp(text)
matches = matcher(doc)

In [55]:
matched_token_span_locations: List[Tuple] = []
for match_id, start, end in matches:
    matched_token_span_locations.append((start, end))

for start, end in matched_token_span_locations:
    print(f"start: {start} end: {end} match: {doc[start:end]}")

mached: int = 0
for token in doc:
    if matched_token_span_locations[mached][0] <= token.i < matched_token_span_locations[mached][1]:
        print(f"token: {token.text:12} <----- found")        
    elif token.is_space:
        print(f"token: {repr(token.text)}") 
    else:
        print(f"token: {token.text:12}")        

    if token.i >= matched_token_span_locations[mached][1]:
        if mached < len(matched_token_span_locations) -1:
            mached += 1

start: 8 end: 9 match: matches
start: 30 end: 31 match: matches
token: '\n'
token: A           
token: match       
token: starts      
token: a           
token: fire        
token: .           
token: Modern      
token: matches      <----- found
token: are         
token: small       
token: wooden      
token: sticks      
token: .           
token: '\n'
token: Regex       
token: \w+es       
token: matches     
token: plurals     
token: .           
token: '\n'
token: Little      
token: Girl        
token: Selling     
token: Matches     
token: is          
token: about       
token: a           
token: girl        
token: selling     
token: matches      <----- found
token: dying       
token: .           
token: '\n'


### Multi Token Match

Listing multiple rules defines ```AND``` pattern to match a specific token sequence. Example to find a token sequence ```hello, world``` or ```hello! world```.

In [59]:
pattern = [
    {"LOWER": "hello"},    # AND
    {"IS_PUNCT": True},    # AND
    {"LOWER": "world"}
]
matcher.add(
    "find_hello_punctuation_world", 
    [
        pattern
    ]
)

In [62]:
text: str = "Start learning with hello! world is from The C Programming Language."
doc = nlp(text)
matches = matcher(doc)

In [63]:
matched_token_span_locations: List[Tuple] = []
for match_id, start, end in matches:
    matched_token_span_locations.append((start, end))

mached: int = 0
for token in doc:
    if matched_token_span_locations[mached][0] <= token.i < matched_token_span_locations[mached][1]:
        print(f"token: {token.text:12} <----- found")        
    elif token.is_space:
        print(f"token: {repr(token.text)}") 
    else:
        print(f"token: {token.text:12}")        

    if token.i >= matched_token_span_locations[mached][1]:
        if mached < len(matched_token_span_locations) -1:
            mached += 1

token: Start       
token: learning    
token: with        
token: hello        <----- found
token: !            <----- found
token: world        <----- found
token: is          
token: from        
token: The         
token: C           
token: Programming 
token: Language    
token: .           
