Replies: 1 comment 7 replies
-
Hiya. Happy to try and help with this. Would it be possible to post a reproducible example using fake data that illustrates the issues? It's quite hard to debug without seeing the code and the data. I'm guessing your script looks something like this (obs this is much-simplified). Are you able to reproduce the problems by adapting this example? import pandas as pd
from splink.duckdb.comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker
source_data = [
{
"unique_id": 1,
"name": "John Doe",
"address": "123 Elm St",
"city": "Springfield",
"state": "IL",
"zip_code": "62701",
},
{
"unique_id": 2,
"name": "Jane Smith",
"address": "456 Oak Ave",
"city": "Lincoln",
"state": "NE",
"zip_code": "68502",
},
]
destination_data = [
{
"unique_id": 103,
"name": "Alice Johnson",
"address": "789 Pine Rd",
"city": "Madison",
"state": "WI",
"zip_code": "53703",
},
{
"unique_id": 104,
"name": "Bob Brown",
"address": "101 Maple Blvd",
"city": "Dover",
"state": "DE",
"zip_code": "19901",
},
]
source_df = pd.DataFrame(source_data)
destination_df = pd.DataFrame(destination_data)
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "link_only",
"blocking_rules_to_generate_predictions": [],
"comparisons": [
levenshtein_at_thresholds("name", 2),
exact_match("zip_code"),
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker([source_df, destination_df], settings)
linker.predict().as_pandas_dataframe() |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Robin and team,
Thank you for developing such an extensive package for performing & visualizing record linkage at scale.
I am using it to find potential pairs in a medical entities setting using data from two different sources. The key features used are: name, address, city, state and 'split' zip code.
Additionally I am performing standardization of text in terms of converting abbreviations to full words (e.g. ST. JOHN --> SAINT JOHN, STRT --> STREET) and so on. The zip code is first engineered to be of standard length by suffixing zeros and then broken down into smaller groups e.g. 1st, 2 and 3, 4 and 5th and so on and the first group/number is used for blocking. Next the model is trained by configuring the settings dictionary as seen in some of the Splink's documentation followed by some post processing on the results dataframe.
Results dataframe has total 5.1K records of which 2.2K have different source and destination (the one of interest) while remaining ~3K have the same source and destination reference.
Following are the questions/observations given the above context:
In the settings dictionary, I am using "link_type": "link_only", but it seems that the results also include pairs from de-dup as explained earlier. This is also a possibility in one of the data sources, but not necessarily of primary interest.
Using threshold_match_probability = 0.85 in the linker.cluster_pairwise_predictions_at_threshold() method also yields those pairs which have a match_probability (extracted from pairwise_predictions.as_record_dict(limit = None)) of < 0.85. Is this expected or am I missing out something at this point? I am using city, state, and split zip code as features, so some correlation can be expected here. Note: There are around 300 such records.
Is there a way to extract feature level match probabilities instead of the overall match probability. This can help in further understanding of results/post processing in my view. I believe there is such capability but its included under visualizations offered by Splink (can the waterfall chart back-end data be extracted for this purpose? if yes, how can it be accomplished?)
Can a guideline be provided in terms of whether the current set of features needs to be dropped given the split zip codes are also provided or whether staggered blocking needs to be followed?
I am using the following settings dictionary:
Thanks for going through this post and look forward to hear from you and other users basis prior experience/best practices.
Regards,
AJ.
Beta Was this translation helpful? Give feedback.
All reactions