Close match between two dataframe columns (Dtype ='String') #781

ahsantfw · 2022-09-19T15:29:14Z

ahsantfw
Sep 19, 2022

Hi,
I am using dfiff.get_close_matches for retrieving the close match between two strings, but it's too slow, to even get 1 record (5min24sec).
I have 12 Million records in one data frame column and 110K records in another one.
I want to get close matches for the '110K strings data frame column' from the '12Million records data frame column'.

Can I use 'splink' for this problem, as it seems, for Splink we need '1 column' that exactly matches each to link records.?

If not, do you know about any other solution which is fast enough to process this data to get close matches?

Code:
`import difflib

df1 = difflib.get_close_matches(ref_df['own_name'][1], owner_df['own_name'])[0]`

RobinL · 2022-09-19T16:15:03Z

RobinL
Sep 19, 2022
Maintainer

Splink is really designed for record linkage problems where you have your data split over multiple columns (eg name, date of birth, etc.). If there's a way of splitting your data, then it might be possible. But in general it's not really intended to work on a single column that contains a big bag of words/document

0 replies

mamonu · 2022-09-19T20:10:27Z

mamonu
Sep 19, 2022

Theoretically you could use the spark splink linker and use one of the comparator functions. i believe the Jaccard similarity would be more beneficial as it calculates the intersection/union of substrings in your "bag of words" columns.

But I say theoretically because because of the nature of your data it will still need to perform
DF1.count() * DF2.count() comparisons since you don't have any way to do what is known as "blocking".
It will be an intensive set of calculations for sure.

All is not lost however as you can also use another method (not splink related)
There is a method known as LSH that perhaps can help you when you have "bag of words" columns.
Spark is offering a parallelised LSH version in both Scala spark and Pyspark and its a good bet to try to see if you can make something work using this method.

here is a Scala version blogpost but I'm sure you can find a similar one for pyspark.
Perhaps it will be of use:
https://www.databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html

4 replies

ahsantfw Sep 21, 2022
Author

I am comparing Company names only, so it will be very short sequences. But these ones look like there are for semantically comparing large documents.
Is it so? or it will work for 3,4 words sequences comparison as well

mamonu Sep 26, 2022

@ahsantfw
it will certainly work for 3-4 words. The linkage will not be as good as if you had a bigger set of columns but it would be
the best on what you can get out of the data you have.

So as I said best course of action...
start a basic linkage with splink using Jaccard as a comparison

if that doesnt work have a look on LSH.

if you can wait for 2-3 days I will be able to create some demo notebooks for how to do both if you want.

ahsantfw Sep 27, 2022
Author

Well, First of all, thanks for Guiding me, I had tried multiple suggestions you had given and found about LSH as best, as others were giving huge really bad runtime.

I have got this implementation of LSH:
`from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['AMCOL INTERNATIONAL CORPORATION', 'AMCOL INTERNATIONAL CORPORATION'
]

Create an MinHashLSH index optimized for Jaccard threshold 0.5,

that accepts MinHash objects with 128 permutations functions

lsh = MinHashLSH(threshold=0.4, num_perm=128)

Create MinHash objects

minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash

for i in range(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print("Candidates with Jaccard similarity > 0.4 for input", i, ":", result)`
results:

But the issue with this one is, it is considering input itself as well, so doing unnecessary work.
So if you get an any better implementation on LSH for strings matching, please help me out. I can wait for 2,3 days, No Problem.

ahsantfw Sep 27, 2022
Author

Let me tell you about the amount of data I have, coz I think LSH is not memory efficient method as I tried it using the above code.
I have to find similarities of 12000 Names in another data containing 12 million names.

So far, What I have estimated is, it would take around 13GB to store only minhashes of 12 Million names using the above code and it will also take around 10 hours.
Do you have any better quality implementation for it, please also let me know.
Thanks again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close match between two dataframe columns (Dtype ='String') #781

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Close match between two dataframe columns (Dtype ='String') #781

ahsantfw Sep 19, 2022

Replies: 2 comments · 4 replies

RobinL Sep 19, 2022 Maintainer

mamonu Sep 19, 2022

ahsantfw Sep 21, 2022 Author

mamonu Sep 26, 2022

ahsantfw Sep 27, 2022 Author

Create an MinHashLSH index optimized for Jaccard threshold 0.5,

that accepts MinHash objects with 128 permutations functions

Create MinHash objects

ahsantfw Sep 27, 2022 Author

ahsantfw
Sep 19, 2022

Replies: 2 comments 4 replies

RobinL
Sep 19, 2022
Maintainer

mamonu
Sep 19, 2022

ahsantfw Sep 21, 2022
Author

ahsantfw Sep 27, 2022
Author

ahsantfw Sep 27, 2022
Author