Question on using Splink for Large Data Fuzzy Matching #2005

leroymaxxus · 2024-02-28T01:31:49Z

leroymaxxus
Feb 28, 2024

Hi!

I'm currently using a combination of Excel and Alteryx to run fuzzy matching for million rows of data which is excruciating and time exhaustive. I came upon this page when searching for alternative to run this more efficiently.

For my use case, I'm only looking at two columns - 1/ Account and 2/ Account Name. For column 1 - this column contains the raw data from applications, and column 2 contains the Account Names in our system. In short, we just need to match column 1 to column 2 for us to tag them to the rightful account thereafter.

May I check how can I run this with using Splink and what is the expected runtime for e.g. 750K rows of data?

Thanks in advance.

RobinL · 2024-02-28T09:25:18Z

RobinL
Feb 28, 2024
Maintainer

Thanks for the question. It very much depends what your data looks like. Would you be able to provide a couple of example rows (feel free to fake them, they don't need to be real) and what you expect the result to be? I just need a sense of the format of the data and the kind of values it contains. That'll help me to be able to advise on whether Splink is appropriate/could help.

2 replies

leroymaxxus Mar 6, 2024
Author

Hi @RobinL Hope to get your assistance on whether Splink is applicable for my use case. Previously, I was looking at using Fuzzywuzzy but understand that it may not be efficient for large data of more than 500K.

RobinL Mar 6, 2024
Maintainer

Sorry for the delay. Unfortunately I think your data doesn't really have enough information in it to be a good fit for Splink - see

https://github.com/moj-analytical-services/splink?tab=readme-ov-file#what-data-does-splink-work-best-with

For company names, you typically need auxiliary columns in addition to the name e.g. address and/or email adress and/or phone number etc

leroymaxxus · 2024-02-28T12:40:14Z

leroymaxxus
Feb 28, 2024
Author

For Testing_Fake Account Names.csv

Thanks for the reply. Basically, I'm trying to do fuzzy matching for Column A (Account) to Column B (Account_Name). The first column is value input by individuals that will need to be matched to the closest match in Column B which is from database.

The database contains of >350K account names and often the submitted values are in the range of 100K-200K. Hence, I'm exploring alternatives that can ease the processing strain to match 100K+ records to 350K+ database.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on using Splink for Large Data Fuzzy Matching #2005

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question on using Splink for Large Data Fuzzy Matching #2005

leroymaxxus Feb 28, 2024

Replies: 2 comments · 2 replies

RobinL Feb 28, 2024 Maintainer

leroymaxxus Mar 6, 2024 Author

RobinL Mar 6, 2024 Maintainer

leroymaxxus Feb 28, 2024 Author

leroymaxxus
Feb 28, 2024

Replies: 2 comments 2 replies

RobinL
Feb 28, 2024
Maintainer

leroymaxxus Mar 6, 2024
Author

RobinL Mar 6, 2024
Maintainer

leroymaxxus
Feb 28, 2024
Author