Using Splink only for comparisons #1922

ikonstas-ds · 2024-02-02T16:36:09Z

ikonstas-ds
Feb 2, 2024

Hello,

I am using Splink with Pyspark. I have two datasets that I want to link them (no deduplication). Initially, I was using the following blocking rules but it is taking much time and in some I also lose candidate pairs:

blocking_rules = [
        "l.ZipCode = r.ZipCode and l.Country = r.Country",
        "substr(l.Name, 1, 3) = substr(r.Name, 1, 3) and l.Country = r.Country"
]

I have also implemented a method following the approach of trigrams. And I implemented this UDF with Python:

def get_trigrams(text):
    
    words = text.split(" ")
    trigrams_final = []
    for word in words:
        word_len = len(word)
        word_len_x = word_len if word_len < 3 else 2
        trigrams = [word[:i+1] for i in range(word_len_x)]
        
        if word_len > 2:
            for i in range(len(word) - 2):
                trigram = word[i:i+3]
                trigrams.append(trigram)
            trigrams.append(word[word_len - 2:])
        trigrams_final += trigrams
    return list(set(trigrams_final))

get_trigrams_udf = udf(lambda x: get_trigrams(x), ArrayType(StringType()))

Then, I am able to use a Pipeline from pyspark.ml like this:

model = Pipeline(stages=[
            HashingTF(inputCol="trigrams", outputCol="vectors"),
            MinHashLSH(inputCol="vectors", outputCol="hashes", numHashTables=10)
            ]
        ).fit(df)
...
...
# Perform fuzzy string matching to the transformed data
df_matches = model.stages[-1].approxSimilarityJoin(ms_df_transformed, orbis_df_transformed, 0.6)

This approach works quite fast and I am able to have a high recall to retrieve candidate pairs. I have also trained an efficient comparison model with Splink. So, I want to combine the results of my trigrams approach and the comparison model with Splink.

Is there a way to use these candidate pairs directly for comparisons as the outcome of the blocking rules?

RobinL · 2024-02-06T15:39:53Z

RobinL
Feb 6, 2024
Maintainer

Not easily. You'd want to resolve them into clusters, and then assign those cluster IDs to the input records into Splink, and then use the cluster IDs as the blocking rule

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Splink only for comparisons #1922

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using Splink only for comparisons #1922

ikonstas-ds Feb 2, 2024

Replies: 1 comment

RobinL Feb 6, 2024 Maintainer

ikonstas-ds
Feb 2, 2024

RobinL
Feb 6, 2024
Maintainer