In [1]:
# !pip install -e /Users/rosskennedy/splink

# Choosing String Comparators

When building a Splink model, one of the most important aspects is defining the [`Comparisons`](../comparison.md) and [`Comparison Levels`](../comparison_level.md). When defining `Comparisons` and `Comparison Levels` it is helpful to consider string fuzzy matching.

This guide is intended to show how Splink's string comparators performs in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds).

## What options are available when comparing strings?

There are three main classes of string comparator that are considered within Splink:
1. **String Similarity Scores**
2. **String Distance Scores**
3. **Phonetic Matching**

where  
**String Similarity Scores** are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. [Jaro-Winkler Similarity](comparators.md#jaro-winkler-similarity).  

**String Distance Scores** are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. [Levenshtein Distance](comparators.md#levenshtein-distance).  

**Phonetic Matching** is whether two strings are phonetically similar. The two strings are passed through a [phonetic transformation algorithm](phonetic.md) and then the resulting phonetic codes are matched. E.g. [Double Metaphone](phonetic.md#double-metaphone).

## Comparing String Similarity and Distance Scores

Splink contains a `comparison_helpers` module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.

For comparing two strings the `get_comparator_score` function returns the scores for all of the available comparators. E.g. consider a simple inversion "Richard" vs "iRchard":

In [2]:
import splink.comparison_helpers as ch

ch.comparator_score("Richard", "iRchard")

{'levenshtein_distance': 2,
 'damerau_levenshtein_distance': 1,
 'jaro_similarity': 0.952,
 'jaro_winkler_similarity': 0.952,
 'jaccard_similarity': 1.0}

Now consider a collection of common variations of the name "Richard" - which comparators will catch these as the same?

In [3]:
import pandas as pd
data = {'string1': ['Richard', 'Richard', 'Richard', 'Richard', 'Richard', 'Richard', 'Richard', 'Richard', 'Richard', 'Richard'],
        'string2': ['Richard', 'ichard', 'Richar','iRchard', 'Richadr',  'Rich', 'Rick', 'Ricky', 'Dick', 'Rico'],
        'error_type': ['None', 'Deletion', 'Deletion', 'Transposition', 'Transposition', 'Shortening', 'Alias', 'Alias', 'Alias', 'Alias']}
df = pd.DataFrame(data)
df

Unnamed: 0,string1,string2,error_type
0,Richard,Richard,
1,Richard,ichard,Deletion
2,Richard,Richar,Deletion
3,Richard,iRchard,Transposition
4,Richard,Richadr,Transposition
5,Richard,Rich,Shortening
6,Richard,Rick,Alias
7,Richard,Ricky,Alias
8,Richard,Dick,Alias
9,Richard,Rico,Alias


In [4]:
ch.comparator_score_df(df, "string1", "string2")

Unnamed: 0,string1,string2,levenshtein_distance,damerau_levenshtein_distance,jaro_similarity,jaro_winkler_similarity,jaccard_similarity
0,Richard,Richard,0,0,1.0,1.0,1.0
1,Richard,ichard,1,1,0.952,0.952,0.857
2,Richard,Richar,1,1,0.952,0.971,0.857
3,Richard,iRchard,2,1,0.952,0.952,1.0
4,Richard,Richadr,2,1,0.952,0.971,1.0
5,Richard,Rich,3,3,0.857,0.914,0.571
6,Richard,Rick,4,4,0.726,0.808,0.375
7,Richard,Ricky,4,4,0.676,0.676,0.333
8,Richard,Dick,5,5,0.595,0.595,0.222
9,Richard,Rico,4,4,0.726,0.808,0.375


Or, in a slightly more visual way

In [5]:
ch.comparator_score_heatmap(df, "string1", "string2")

## Choosing thresholds

We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:

In [6]:
ch.comparator_score_df(df, "string1", "string2")

Unnamed: 0,string1,string2,levenshtein_distance,damerau_levenshtein_distance,jaro_similarity,jaro_winkler_similarity,jaccard_similarity
0,Richard,Richard,0,0,1.0,1.0,1.0
1,Richard,ichard,1,1,0.952,0.952,0.857
2,Richard,Richar,1,1,0.952,0.971,0.857
3,Richard,iRchard,2,1,0.952,0.952,1.0
4,Richard,Richadr,2,1,0.952,0.971,1.0
5,Richard,Rich,3,3,0.857,0.914,0.571
6,Richard,Rick,4,4,0.726,0.808,0.375
7,Richard,Ricky,4,4,0.676,0.676,0.333
8,Richard,Dick,5,5,0.595,0.595,0.222
9,Richard,Rico,4,4,0.726,0.808,0.375


In [7]:
ch.comparator_score_threshold_heatmap(df, "string1", "string2", 
                    distance_threshold = 2, 
                    similarity_threshold = 0.9)

### Phonetic Matching 

There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.

To see the phonetic transformations for a single string, there is the `phonetic_transform` function:

In [8]:
import splink.comparison_helpers

ch.phonetic_transform("Richard")

{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}

In [9]:
ch.phonetic_transform("Robert")

{'soundex': 'R01063', 'metaphone': 'RPRT', 'dmetaphone': ('RPRT', '')}

Or, for comparing two strings for phonetic similarity:

In [10]:
ch.phonetic_match("Stephen", "Steven")

[True, True, True]

Consider a dataframe, similar to above:

In [11]:
df = pd.DataFrame({'string1': ['Stephen', 'Stephen', 'Stephen', 'Stephen' 'Aidan', 'Lily', 'Caleb', 'Emma', 'Jason', 'Maya', 'Ryan', 'Ava', 'Leila', 'Oliver'],
                'string2': ['Stephen', 'Steven', 'Stephan', 'Steve', 'Hayden', 'Billy', 'Kaleb', 'Gemma', 'Mason', 'Kaya', 'Brian', 'Ada', 'Lila', 'Alexander']})

df.head()

Unnamed: 0,string1,string2
0,Stephen,Stephen
1,Stephen,Steven
2,Stephen,Stephan
3,Aidan,Hayden
4,Lily,Billy


In [12]:
ch.phonetic_match_df(df, "string1", "string2")

Stephen Stephen
[True, True, True]


TypeError: list indices must be integers or slices, not str