Replies: 2 comments 4 replies
-
Splink is really designed for record linkage problems where you have your data split over multiple columns (eg name, date of birth, etc.). If there's a way of splitting your data, then it might be possible. But in general it's not really intended to work on a single column that contains a big bag of words/document |
Beta Was this translation helpful? Give feedback.
-
Theoretically you could use the spark splink linker and use one of the comparator functions. i believe the Jaccard similarity would be more beneficial as it calculates the intersection/union of substrings in your "bag of words" columns. But I say theoretically because because of the nature of your data it will still need to perform All is not lost however as you can also use another method (not splink related) here is a Scala version blogpost but I'm sure you can find a similar one for pyspark. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am using dfiff.get_close_matches for retrieving the close match between two strings, but it's too slow, to even get 1 record (5min24sec).
I have 12 Million records in one data frame column and 110K records in another one.
I want to get close matches for the '110K strings data frame column' from the '12Million records data frame column'.
Can I use 'splink' for this problem, as it seems, for Splink we need '1 column' that exactly matches each to link records.?
If not, do you know about any other solution which is fast enough to process this data to get close matches?
Code:
`import difflib
df1 = difflib.get_close_matches(ref_df['own_name'][1], owner_df['own_name'])[0]`
Beta Was this translation helpful? Give feedback.
All reactions