New Splink User Qns. #1760

ABJ66 · 2023-11-22T14:22:58Z

ABJ66
Nov 22, 2023

Hi Robin and team,

Thank you for developing such an extensive package for performing & visualizing record linkage at scale.

I am using it to find potential pairs in a medical entities setting using data from two different sources. The key features used are: name, address, city, state and 'split' zip code.

Additionally I am performing standardization of text in terms of converting abbreviations to full words (e.g. ST. JOHN --> SAINT JOHN, STRT --> STREET) and so on. The zip code is first engineered to be of standard length by suffixing zeros and then broken down into smaller groups e.g. 1st, 2 and 3, 4 and 5th and so on and the first group/number is used for blocking. Next the model is trained by configuring the settings dictionary as seen in some of the Splink's documentation followed by some post processing on the results dataframe.

Results dataframe has total 5.1K records of which 2.2K have different source and destination (the one of interest) while remaining ~3K have the same source and destination reference.

Following are the questions/observations given the above context:

In the settings dictionary, I am using "link_type": "link_only", but it seems that the results also include pairs from de-dup as explained earlier. This is also a possibility in one of the data sources, but not necessarily of primary interest.
Using threshold_match_probability = 0.85 in the linker.cluster_pairwise_predictions_at_threshold() method also yields those pairs which have a match_probability (extracted from pairwise_predictions.as_record_dict(limit = None)) of < 0.85. Is this expected or am I missing out something at this point? I am using city, state, and split zip code as features, so some correlation can be expected here. Note: There are around 300 such records.
Is there a way to extract feature level match probabilities instead of the overall match probability. This can help in further understanding of results/post processing in my view. I believe there is such capability but its included under visualizations offered by Splink (can the waterfall chart back-end data be extracted for this purpose? if yes, how can it be accomplished?)
Can a guideline be provided in terms of whether the current set of features needs to be dropped given the split zip codes are also provided or whether staggered blocking needs to be followed?

I am using the following settings dictionary:

settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on("zip_1st")
    ],
    "comparisons": [
        ctl.name_comparison("name", term_frequency_adjustments=True),
        ctl.name_comparison("name_alias", term_frequency_adjustments=True),
        ctl.name_comparison("address", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("city", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("statecode", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("zip_2nd_3rd", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("zip_4th_5th", term_frequency_adjustments=True),
        cl.damerau_levenshtein_at_thresholds("zip_last_4", term_frequency_adjustments=True)
    ],
    "retain_intermediate_calculation_columns" : True
}

Thanks for going through this post and look forward to hear from you and other users basis prior experience/best practices.

Regards,
AJ.

RobinL · 2023-11-22T14:39:21Z

RobinL
Nov 22, 2023
Maintainer

Hiya. Happy to try and help with this.

Would it be possible to post a reproducible example using fake data that illustrates the issues? It's quite hard to debug without seeing the code and the data.

I'm guessing your script looks something like this (obs this is much-simplified). Are you able to reproduce the problems by adapting this example?

import pandas as pd

from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

source_data = [
    {
        "unique_id": 1,
        "name": "John Doe",
        "address": "123 Elm St",
        "city": "Springfield",
        "state": "IL",
        "zip_code": "62701",
    },
    {
        "unique_id": 2,
        "name": "Jane Smith",
        "address": "456 Oak Ave",
        "city": "Lincoln",
        "state": "NE",
        "zip_code": "68502",
    },
]


destination_data = [
    {
        "unique_id": 103,
        "name": "Alice Johnson",
        "address": "789 Pine Rd",
        "city": "Madison",
        "state": "WI",
        "zip_code": "53703",
    },
    {
        "unique_id": 104,
        "name": "Bob Brown",
        "address": "101 Maple Blvd",
        "city": "Dover",
        "state": "DE",
        "zip_code": "19901",
    },
]


source_df = pd.DataFrame(source_data)
destination_df = pd.DataFrame(destination_data)


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        levenshtein_at_thresholds("name", 2),
        exact_match("zip_code"),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker([source_df, destination_df], settings)

linker.predict().as_pandas_dataframe()

7 replies

ABJ66 Nov 28, 2023
Author

Hi Robin,

Thanks for your above comments, I am able to extract the bayes factor and hence partial weights across attributes and record pairs. However, using link_type = 'link_only' still results in getting results from the same 'source' along with 'across' sources and the quantum of same source is almost 60% or so in most of the experiments I've run till date.

Aside: Pls let me know if I need to include these three components in the code as well:
1) defining 'deterministic rules',
2) using them in 'estimate_probability_two_random_records_match',
3) using blocking while model training (I am already using state information as blocking to generate predictions).

Something like below (for 1 and 2 above ) (Scenario #1):

deterministic_rules = [
    "l.zip_4th_5th = r.zip_4th_5th",
    "l.zip_last_4 = r.zip_last_4",
    "l.state =r.state and l.city = r.city",
    "l.name = r.name and l.address = r.address"
]
 
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

Also for the deterministic rules it is good enough to say that if select group of characters of zip code matches or the state and city matches then it is a true match or should it be as exhaustive as:

(Scenario #2)


deterministic_rules = [
    "l.city = r.city and l.zip_4th_5th = r.zip_4th_5th and l.zip_last_4 = r.zip_last_4",
    "l.name = r.name and l.address = r.address and l.statecode = r.statecode and l.city = r.city"
]
 
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

Which of the above 2 scenarios (#1 or # 2) is better choice? I feel the 2nd more is more exhaustive and mimics an exact match

I am using statecode for blocking and dropping the first few characters from zipcode as this correlated with city information as well.

Look forward to your inputs and thank you for your guidance so far.

RobinL Nov 28, 2023
Maintainer

The deterministic rules should be rules that you believe will almost always represent a true match. For example, where every field matches.

Within those rules, you might not expect all true matches to be recovered. The recall parameter (which you must guess) is used to make an adjustment. For example, if you think your determnistic rules recover 80% of all matches, then it should be set to 0.8.

I'm not able to help with with the pairs with the same source unless you're able to provide a reprex since i'm not able to reproduce the behaviour at my end

aalexandersson Dec 1, 2023

I also do not follow the claimed de-duplication in question 1 (but I am only a regular user of splink for 1+ year):

I am using "link_type": "link_only", but it seems that the results also include pairs from de-dup as explained earlier."

What does "as explained earlier" refer to? Please show the code for the de-dup because de-dup is not included by default. For how-to de-dup, see the SQL code in this topic guide:
https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/querying_splink_results.html#querying-tables

ABJ66 Dec 3, 2023
Author

Hi Robin and Alex,

I am able to figure out the source of confusion. I was relying on the clustering results rather than pairwise predictions. Furthermore, data wrangling was done on these results to check if a pair has been generated from source and destination datasets or not. I got an impression that record pairs solely from either one of datasets (source or destination) are also being outputted along with the true pairs from 2 datasets i.e. source and destination - which is incorrect as clustering tries to generate a person_id which can come from all 3 combinations depending on the data features and their quality.

Now using the pairwise predictions in pandas df format, I have all the relevant info on the individual columns/features used for comparison/linkage. This resolves Qns 1 through 3 in my original post. I will be experimenting on various settings and might reach out in case of further qns.

Thanks again.

RobinL Dec 3, 2023
Maintainer

Thanks - appreciate the update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Splink User Qns. #1760

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

New Splink User Qns. #1760

ABJ66 Nov 22, 2023

Replies: 1 comment · 7 replies

RobinL Nov 22, 2023 Maintainer

ABJ66 Nov 28, 2023 Author

RobinL Nov 28, 2023 Maintainer

aalexandersson Dec 1, 2023

ABJ66 Dec 3, 2023 Author

RobinL Dec 3, 2023 Maintainer

ABJ66
Nov 22, 2023

Replies: 1 comment 7 replies

RobinL
Nov 22, 2023
Maintainer

ABJ66 Nov 28, 2023
Author

RobinL Nov 28, 2023
Maintainer

ABJ66 Dec 3, 2023
Author

RobinL Dec 3, 2023
Maintainer