Question on using Splink for Large Data Fuzzy Matching #2005
Replies: 2 comments 2 replies
-
Thanks for the question. It very much depends what your data looks like. Would you be able to provide a couple of example rows (feel free to fake them, they don't need to be real) and what you expect the result to be? I just need a sense of the format of the data and the kind of values it contains. That'll help me to be able to advise on whether Splink is appropriate/could help. |
Beta Was this translation helpful? Give feedback.
-
For Testing_Fake Account Names.csv Thanks for the reply. Basically, I'm trying to do fuzzy matching for Column A (Account) to Column B (Account_Name). The first column is value input by individuals that will need to be matched to the closest match in Column B which is from database. The database contains of >350K account names and often the submitted values are in the range of 100K-200K. Hence, I'm exploring alternatives that can ease the processing strain to match 100K+ records to 350K+ database. |
Beta Was this translation helpful? Give feedback.
-
Hi!
I'm currently using a combination of Excel and Alteryx to run fuzzy matching for million rows of data which is excruciating and time exhaustive. I came upon this page when searching for alternative to run this more efficiently.
For my use case, I'm only looking at two columns - 1/ Account and 2/ Account Name. For column 1 - this column contains the raw data from applications, and column 2 contains the Account Names in our system. In short, we just need to match column 1 to column 2 for us to tag them to the rightful account thereafter.
May I check how can I run this with using Splink and what is the expected runtime for e.g. 750K rows of data?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions