# Spaczz FuzzyMatcher

* [spaczz: Fuzzy matching and more for spaCy](https://github.com/gandersen101/spaczz)

> spaczz provides fuzzy matching and additional regex matching functionality for spaCy. spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.

* [RapidFuzz](https://github.com/maxbachmann/rapidfuzz)

> RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy. However there are a couple of aspects that set RapidFuzz apart from FuzzyWuzzy:

In [1]:
# !pip install spaczz

In [2]:
%%html
<style>
table {float:left}
</style>

In [3]:
from typing import (
    List, 
    Dict,
    Tuple
)
import json
import spacy
from spaczz.matcher import FuzzyMatcher

# Language Model

In [4]:
# spacy.cli.download("en_core_web_lg")

nlp = spacy.blank("en")
vocabulrary: spacy.vocab.Vocab = nlp.vocab

# FuzzyMatcher


The matcher must always share the same ```vocab``` of the documents it will operate on. Use the vocabulrary of the language model.

In [5]:
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

In [39]:
matcher = FuzzyMatcher(nlp.vocab)

matcher.add("NAME", [nlp("Grant Andersen")])
matcher.add("GPE", [nlp("Nashville")])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    start_char = doc[start].idx
    end_char = doc[end].idx
    print(start_char, end_char, text[start_char:end_char])
    print(f"match:[{doc[start:end]}], ratio:{ratio}")

0 17 Grint M Anderson 
match:[Grint M Anderson], ratio:80
69 77 Nashv1le
match:[Nashv1le], ratio:82


In [36]:
text[0:17]

'Grint M Anderson '

In [7]:
len("Rdley Scott was the director of Alien.")

38

---

In [10]:
import sys
sys.path.append("/Users/oonisim/home/repository/git/oonisim/lib/code/python")

In [44]:
"""Module for utlity with spaczz"""
import logging
from typing import (
    Tuple,
    List,
    Generator
)

import spacy
from spaczz.matcher import FuzzyMatcher


from lib.util_logging import (
    get_logger
)


# --------------------------------------------------------------------------------
# Logging
# --------------------------------------------------------------------------------
_logger: logging.Logger = get_logger(__name__)


# --------------------------------------------------------------------------------
# Spacy English
# --------------------------------------------------------------------------------
_nlp = spacy.blank("en")
# pylint: disable=no-name-in-module, unused-import
_vocabulrary: spacy.vocab.Vocab = _nlp.vocab


# --------------------------------------------------------------------------------
# Utility
# --------------------------------------------------------------------------------
def fuzzy_match_sequence_generator(
        text: str,
        patterns: List[str],
        minimum_match_ration: int = 75
) -> Generator[Tuple[int, int, int, str], None, None]:
    """Find the patten matches in the text and return a sequence of tuples
    (cursor, match_start_pos, match_end_pos+1, matched) generator.
    For instance if text is "Rdley Scott was the director of Alien." and pattern is
    "Ridley Scott", then return the sequence of:
    (0, 0, 11, "Rdley Scott")
    (11, 38, 38, '')

    Args:
        text: text to find regexp matches
        patterns: pattern to search
        minimum_match_ration: minimum match ratio required

    Returns: generator to return (preceding, match).
    """
    _func_name: str = "fuzzy_match_sequence_generator()"
    if hasattr(_logger, "is_trace_enabled") and _logger.is_trace_enabled:
        _logger.debug(
            "%s: run FuzzyMatch patterns %s on text\n[%s].",
            _func_name, patterns, text
        )

    doc = _nlp(text)
    matcher: FuzzyMatcher = FuzzyMatcher(_nlp.vocab)
    for index, _pattern in enumerate(patterns):
        matcher.add(
            label=str(index),
            patterns=[_nlp.make_doc(_pattern)],
            kwargs=[{
                'min_r': minimum_match_ration
            }]
        )

    # spaczz.readthedocs.io/en/latest/reference.html#spaczz.matcher.FuzzyMatcher.__call__
    matches = matcher(doc)
    cursor: int = 0     # char position in text to point to the next start.
    end: int = len(text)
    try:
        for label, start_token_pos, end_token_pos, match_ratio, pattern in matches:
            start_char_pos: int = doc[start_token_pos].idx
            end_char_pos: int = doc[end_token_pos].idx

            if hasattr(_logger, "is_trace_enabled") and _logger.is_trace_enabled:
                _logger.debug(
                    "%s: [%s]th pattern:[%s] matched with [%s]%% with text[%s:%s]:\n[%s].",
                    _func_name, label, pattern, match_ratio,
                    start_char_pos, end_char_pos, text[start_char_pos:end_char_pos]
                )

            yield cursor, start_char_pos, end_char_pos, text[start_char_pos:end_char_pos]
            cursor = end_char_pos

        # Collect the rest in the text
        if cursor < end:
            yield cursor, len(text), len(text), ''

    except StopIteration:
        return


In [45]:
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""

In [46]:
generator = fuzzy_match_sequence_generator(
    text=text,
    patterns=[
        "Grint M Anderson",
        "5555 Faker St"
    ]
)
for cursor, start_char_pos, end_char_pos, match in generator:
    print(cursor, start_char_pos, end_char_pos, match)

DEBUG:__main__:fuzzy_match_sequence_generator(): run FuzzyMatch patterns ['Grint M Anderson', '5555 Faker St'] on text
[Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US.].
DEBUG:__main__:fuzzy_match_sequence_generator(): [0]th pattern:[Grint M Anderson] matched with [100]% with text[0:17]:
[Grint M Anderson ].
DEBUG:__main__:fuzzy_match_sequence_generator(): [1]th pattern:[5555 Faker St] matched with [92]% with text[47:58]:
[555 Fake St].


0 0 17 Grint M Anderson 
17 47 58 555 Fake St
58 103 103 


In [47]:
text[58:103]

',\nApt 5 in Nashv1le, TN 55555-1234 in the US.'