# Spacy Matcher

* [Rule-based matching](https://spacy.io/usage/rule-based-matching)
* [Matcher](https://spacy.io/api/matcher)
* [Rule-based Matcher Explorer](https://demos.explosion.ai/matcher)

In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
from typing import (
    List, 
    Dict,
    Tuple
)
import json

import pandas as pd
import spacy
from spacy.matcher import Matcher
from spacy.symbols import (
    nsubj, nsubjpass, dobj, iobj, pobj
)

# Language Model

In [3]:
# spacy.cli.download("en_core_web_lg")

nlp = spacy.load("en_core_web_lg")
vocabulrary: spacy.vocab.Vocab = nlp.vocab

# Matcher

The matcher must always share the same ```vocab``` of the documents it will operate on. Use the vocabulrary of the language model.

In [4]:
matcher = Matcher(vocabulrary)

## Pattern (Token Sequence)

Pattern is to find a sequence of token**s** whose [attributes](https://spacy.io/api/attributes) match the rules defined. 

* each **pattern** is a list of rules sequenced as ```AND``` logic.
* each **rule** is a dictionary listing one or more expression ```{expression+}```.
* each **expression** can have ```token-attribute : value [: operator]``` where ```operator``` is a regular expression repetition operator.

```
pattern=[
    {expression},
    {expression},
    ...
]
```

### Operator

To match exactly once, omit the OP.


| OP     |                                                                     |
|:-------|:--------------------------------------------------------------------|
| !      | Negate the pattern, by requiring it to match exactly 0 times.       |
| ?      | Make the pattern optional, by allowing it to match 0 or 1 times.    |
| +      | Require the pattern to match 1 or more times.                       |
| *      | Allow the pattern to match zero or more times.                      |
| {n}    | Require the pattern to match exactly n times.                       |
| {n,m}  | Require the pattern to match at least n but not more than m times.  |
| {n,}   | Require the pattern to match at least n times.                      |
| {,m}   | Require the pattern to match at most m times.                       |

### Single Token Match 
Example to find one token whose ```POS``` is **noun**, ```lemma``` is **match**, in ```LOWER``` case as **matches**.

In [5]:
text: str = """
A match starts a fire. Modern matches are small wooden sticks.
Regex \w+es matches plurals.
Little Girl Selling Matches is about a girl selling matches dying.
"""
doc = nlp(text)

In [6]:
rule = {
    'POS': 'NOUN', 
    'LEMMA': 'match',
    'LOWER': 'matches',
    'OP': '?',
}

pattern: List[Dict[str, str]] = [
    rule    
]
print(json.dumps(pattern, indent=4))

[
    {
        "POS": "NOUN",
        "LEMMA": "match",
        "LOWER": "matches",
        "OP": "?"
    }
]


In [7]:
matcher.add(
    "find_noun_matches", 
    [
        pattern
    ]
)

#### as_span

Find the matches. Matcher returns spans with ```as_spans=True```, otherwise ```(match_id, start, end)```. Stick to spans so that later tools like [util.filter_spans](https://spacy.io/api/top-level#util.filter_spans) can be applied to remove duplicates.

* [Matcher.__call__(doclike, as_spans)](https://spacy.io/api/matcher#call)

> ```as_spans```:  Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False.


```Span``` class gives character indices with ```start_char``` and ```end_char``` attributes.

* [Spacy2 Matcher receiving position of match entity from text](https://github.com/explosion/spaCy/issues/2544)

```
for match_id, start, end in matches:
    span = doc[start: end]
    print(span.text, span.start_char, span.end_char)
```

In [8]:
matches = matcher(doclike=doc, as_spans=True)

In [9]:
matched_token_span_locations: List[Tuple] = []
# for match_id, start, end in matches:
for match in matches:
    matched_token_span_locations.append((match.start, match.end, match.start_char, match.end_char))

for start, end, start_char, end_char in matched_token_span_locations:
    print(f"start token: {start}:{start_char} end token: {end}:{end_char} match: {doc[start:end]}")

mached: int = 0
for token in doc:
    if matched_token_span_locations[mached][0] <= token.i < matched_token_span_locations[mached][1]:
        print(f"token: {token.text:12} <----- found")        
    elif token.is_space:
        print(f"token: {repr(token.text)}") 
    else:
        print(f"token: {token.text:12}")        

    if token.i >= matched_token_span_locations[mached][1]:
        if mached < len(matched_token_span_locations) -1:
            mached += 1

start token: 8:31 end token: 9:38 match: matches
start token: 30:145 end token: 31:152 match: matches
token: '\n'
token: A           
token: match       
token: starts      
token: a           
token: fire        
token: .           
token: Modern      
token: matches      <----- found
token: are         
token: small       
token: wooden      
token: sticks      
token: .           
token: '\n'
token: Regex       
token: \w+es       
token: matches     
token: plurals     
token: .           
token: '\n'
token: Little      
token: Girl        
token: Selling     
token: Matches     
token: is          
token: about       
token: a           
token: girl        
token: selling     
token: matches      <----- found
token: dying       
token: .           
token: '\n'


### Multi Token Match

Listing multiple rules defines ```AND``` pattern to match a specific token sequence. Example to find a token sequence ```hello, world``` or ```hello! world```.

In [10]:
text: str = "Start learning with hello! world is from The C Programming Language."
doc = nlp(text)

In [11]:
matcher = Matcher(nlp.vocab)
pattern = [
    {"LOWER": "hello"},    # AND
    {"IS_PUNCT": True},    # AND
    {"LOWER": "world"}
]
matcher.add(
    "find_hello_punctuation_world", 
    [
        pattern
    ]
)

In [12]:
matches = matcher(doclike=doc, as_spans=True)

In [13]:
matched_token_span_locations: List[Tuple] = []
# for match_id, start, end in matches:
for match in matches:
    matched_token_span_locations.append((match.start, match.end))

mached: int = 0
for token in doc:
    if matched_token_span_locations[mached][0] <= token.i < matched_token_span_locations[mached][1]:
        print(f"token: {token.text:12} <----- found")        
    elif token.is_space:
        print(f"token: {repr(token.text)}") 
    else:
        print(f"token: {token.text:12}")        

    if token.i >= matched_token_span_locations[mached][1]:
        if mached < len(matched_token_span_locations) -1:
            mached += 1

token: Start       
token: learning    
token: with        
token: hello        <----- found
token: !            <----- found
token: world        <----- found
token: is          
token: from        
token: The         
token: C           
token: Programming 
token: Language    
token: .           


---
# Handle duplicates

A pattern can have multiple overlapping spans (including the same word at the same position). To reduce to the longest span, use [util.filter_spans](https://spacy.io/api/top-level#util.filter_spans).

In [14]:
text: str = "He has his multiple guitars and beautiful old pianos."
doc = nlp(text)

Noun phrase ```(ADJ*, NOUN)``` pattern matches multiple spans.

In [15]:
matcher = Matcher(nlp.vocab)
pattern = [
    {"POS": "ADJ", "OP": '*'},{"POS": "NOUN", "OP": "{1}"}
]
matcher.add(
    "find_noun_phrases",
    [
        pattern
    ]
)

In [16]:
matches = matcher(doclike=doc, as_spans=True)
for match in matches:
    print(f"{match.text:30} start:[{match.start:<4}] end:[{match.end:<4}]")

multiple guitars               start:[3   ] end:[5   ]
guitars                        start:[4   ] end:[5   ]
beautiful old pianos           start:[6   ] end:[9   ]
old pianos                     start:[7   ] end:[9   ]
pianos                         start:[8   ] end:[9   ]


### Limit to the longest span

In [17]:
for span in spacy.util.filter_spans(matches):
    print(f"{span.start:<4}{span.end:<4}{span.text}")

3   5   multiple guitars
6   9   beautiful old pianos


---

# Noun Phrase

## Using Spacy ```spacy.doc.noun_chunks```

Spacy has doc.noun_chunks to extract noun phrases instead of using a matcher.

In [18]:
doc = nlp("He has multiple guitar and five beautiful pianos in his collection.")

In [19]:
for n in doc.noun_chunks:
    print(n)

He
multiple guitar
five beautiful pianos
his collection


## Using Spacy syntactic context

* [Noun phrases with spacy](https://stackoverflow.com/a/33512175/4281353)

> the best way is to iterate over the words of the sentence and consider the syntactic context to determine whether the word governs the phrase-type you want. If it does, yield its subtree
> ```
> from spacy.symbols import *
> 
> np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
> def iter_nps(doc):
>     for word in doc:
>         if word.dep in np_labels:
>             yield word.subtree
> ```



In [20]:
np_labels = {
    nsubjpass, dobj, iobj, pobj
}

for word in doc:
    if word.dep in np_labels:
        print(' '.join([token.lemma_ for token in word.subtree]))

multiple guitar and five beautiful piano in his collection
his collection


### Using matcher pattern

* [Noun phrases with spacy](https://stackoverflow.com/a/33512175/4281353)

>  to specify more exactly which kind of noun phrase you want to extract, you can use textacy's matches function. You can pass any combination of POS tags.  
> ```
> textacy.extract.matches(doc, "POS:ADP POS:DET:? POS:ADJ:? POS:NOUN:+")
> ```

Textacty is just a wrapper of Spacy. Use Spacy matcher pattern is equivalent. For example, to get nouns that are preceded by a preposition and optionally by a determiner and/or adjective.

In [21]:
doc = nlp("He hit at the old guitar case  twice.")

matcher = Matcher(nlp.vocab)
pattern = [
    {"LOWER": "hello"},    # AND
    {"IS_PUNCT": True},    # AND
    {"LOWER": "world"}
]
pattern = [
    {"POS":"ADP"}, 
    {"POS": "DET", "OP":"?"}, 
    {"POS":"ADJ", "OP":"?"}, 
    {"POS":"NOUN", "OP":"+"}
]
matcher.add(
    "find_complex", 
    [
        pattern
    ]
)

In [22]:
matches = spacy.util.filter_spans(matcher(doclike=doc, as_spans=True))
for span in matches:
    print(span)

at the old guitar case
