Using Splink only for comparisons #1922
ikonstas-ds
started this conversation in
General
Replies: 1 comment
-
Not easily. You'd want to resolve them into clusters, and then assign those cluster IDs to the input records into Splink, and then use the cluster IDs as the blocking rule |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am using Splink with Pyspark. I have two datasets that I want to link them (no deduplication). Initially, I was using the following blocking rules but it is taking much time and in some I also lose candidate pairs:
I have also implemented a method following the approach of trigrams. And I implemented this UDF with Python:
Then, I am able to use a Pipeline from pyspark.ml like this:
This approach works quite fast and I am able to have a high recall to retrieve candidate pairs. I have also trained an efficient comparison model with Splink. So, I want to combine the results of my trigrams approach and the comparison model with Splink.
Is there a way to use these candidate pairs directly for comparisons as the outcome of the blocking rules?
Beta Was this translation helpful? Give feedback.
All reactions