## Linking the pseudopeople Census and ACS datasets

In this tutorial we will configure and link two realistic, simulated datasets generated by the [pseudopeople Python package](https://pseudopeople.readthedocs.io/en/latest/index.html). These datasets reflect a [fictional sample population](https://pseudopeople.readthedocs.io/en/latest/simulated_populations/index.html) of ~10,000 simulants living in Anytown, Washington, USA, but pseudopeople can also generate [datasets](https://pseudopeople.readthedocs.io/en/latest/datasets/index.html) about two larger fictional populations, one simulating the state of Rhode Island, and the other simulating the entire United States.

Here we will link Anytown's 2020 [Decennial Census dataset](https://pseudopeople.readthedocs.io/en/latest/datasets/index.html#us-decennial-census) to its 2020 [American Community Survey (ACS) dataset](https://pseudopeople.readthedocs.io/en/latest/datasets/index.html#american-community-survey-acs). We expect every person surveyed in the smaller ACS dataset to also be counted in the census data, so we will aim to link every simulated person in the ACS to the same simulated person in the Census.

This tutorial is adapted from the [Febrl4 linking example](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/febrl4.html).

<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/examples/duckdb/pseuodopeople-acs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [137]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install splink
# !pip install pseuodopeople

### Configuring pseudopeople, exploring data and defining the model


pseudopeople is designed to generate realistic datasets which are challenging to link to one another in many of the same ways that actual datasets are challenging to link. This requires adding [noise](https://pseudopeople.readthedocs.io/en/latest/noise/index.html) to the data in the form of various types of errors that occur in real data collection and entry. The frequencies of each type of noise in the dataset can be configured for each column.

Because the ACS dataset is small and therefore has less opportunities for noise to create linkage challenges,  let's increase the noise for the `race_ethnicity` and `last_name` columns in both datasets from their default values. For `race_ethnicity` we increase the frequency with which respondents choose the wrong response option to that survey question, and for `last_name` we increase both the frequency of respondents who are prone to last name typos, and the probability that each character will be typed incorrectly for those respondents. See [here](https://pseudopeople.readthedocs.io/en/latest/noise/column_noise.html#id16) for more details.


In [1]:
import pseudopeople as psp
import numpy as np
import pandas as pd

In [2]:
config_census = {
    "decennial_census": {  # Dataset
        "column_noise": {  # "Choose the wrong option" is in the column-based noise category
            "race_ethnicity": {  # Column
                "choose_wrong_option": {  # Noise type
                    "cell_probability": 0.3,  # Default = .01
                },
            },
            "last_name": {  # Column
                "make_typos": {  # Noise type
                    "cell_probability": 0.1,  # Default = .01
                    "token_probability": 0.15,  # Default = .10
                },
            },
        },
    },
}
config_acs = {
    "american_community_survey": {  # Dataset
        "column_noise": {  # "Choose the wrong option" is in the column-based noise category
            "race_ethnicity": {  # Column
                "choose_wrong_option": {  # Noise type
                    "cell_probability": 0.3,  # Default = .01
                },
            },
            "last_name": {  # Column
                "make_typos": {  # Noise type
                    "cell_probability": 0.1,  # Default = .01
                    "token_probability": 0.15,  # Default = .10
                },
            },
        },
    },
}

Next, lets get the data ready for Splink. Note that the Census data has about 10,000 rows, while the ACS data has only 74. Note that each dataset has a column called ```simulant_id```, which uniquely identifies a simulated person in our fictional population. The ```simulant_id``` is consistent across datasets, and can be used to check the accuracy of our model. Because it represents the "truth", we will not use it for blocking, comparisons, or any other part of our model, except to check the accuracy of our predictions at the end.

In [67]:
census = psp.generate_decennial_census(config=config_census)
acs = psp.generate_american_community_survey(config=config_acs, year=None)

# rename survey date/year columns to match for exploratory analysis charts
census = census.rename({"year": "survey_date"}, axis=1)
census.survey_date = pd.to_datetime(census.survey_date, format="%Y")

# uniquely identify each row in the datasets (regardless of simulant_id)
census["id"] = census.index
acs["id"] = len(census.index) + acs.index

# create row ID to simulant_id lookup table
census_ids = census[["id", "simulant_id"]]
acs_ids = acs[["id", "simulant_id"]]
simulant_lookup = (
    pd.concat([census_ids, acs_ids]).reset_index(drop=True).set_index("id")
)


def prepare_data(data):
    # concatenate address fields, setting the new field to NaN
    # if any address fields besides unit number are missing
    data.unit_number = data.unit_number.fillna("")

    columns_to_concat = [
        "street_number",
        "street_name",
        "unit_number",
        "city",
        "state",
        "zipcode",
    ]

    has_addr_nan = data[columns_to_concat].isna().any(axis=1)

    address_data = data[columns_to_concat].astype(str)
    data["address"] = address_data.agg(" ".join, axis=1)
    data.loc[has_addr_nan, "address"] = np.nan

    return data


dfs = [prepare_data(dataset) for dataset in [census, acs]]

dfs[0]  # Census

                                                                 

Unnamed: 0,simulant_id,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,survey_date,id,address
0,0_2,0_7,Diana,P,Kofron,25,05/06/1994,5112,145th st,,Anytown,WA,00000,Household,Reference person,Female,Asian,2020-01-01,0,5112 145th st Anytown WA 00000
1,0_3,0_7,Anna,A,Kofron,25,09/29/1994,5112,145th st,,Anytown,WA,00000,Household,Other relative,Female,White,2020-01-01,1,5112 145th st Anytown WA 00000
2,0_923,0_8033,Gerald,R,Butler,76,11/03/1943,1130,mallory ln,,Anytown,WA,00000,Household,Reference person,Male,Black,2020-01-01,2,1130 mallory ln Anytown WA 00000
3,0_2641,0_1066,Loretta,T,Carley,61,07/71/1958,,delacorte dr,,Anytown,WA,00000,Household,Reference person,Female,White,2020-01-01,3,
4,0_2801,0_1138,Richard,R,Jones,73,03/03/1947,950,caribou lane,,Anytown,WA,00000,Household,Reference person,Male,White,2020-01-01,4,950 caribou lane Anytown WA 00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10226,0_11994,0_8051,Lauren,H,Consul,17,10/25/2002,3304,ethan allen way,unit 200,Anytown,WA,00000,Household,Other relative,Female,White,2020-01-01,10226,3304 ethan allen way unit 200 Anytown WA 00000
10227,0_19693,0_6152,Johana,M,Huang,20,08/04/1999,1095,ernst st,,Anytown,WA,00000,Household,Other relative,Female,Asian,2020-01-01,10227,1095 ernst st Anytown WA 00000
10228,0_19556,0_2064,Benjamin,F,Allen,19,02/26/Z001,2002,203rd pl se,,Anytown,WA,00000,Household,Other relative,Male,AIAN,2020-01-01,10228,2002 203rd pl se Anytown WA 00000
10229,0_19579,0_1802,Brielle,L,Gonzalez,19,11/27/2000,233,saint peters road,,Anytown,WA,00000,Household,Other relative,Female,Latino,2020-01-01,10229,233 saint peters road Anytown WA 00000


In [69]:
dfs[1]  # ACS

Unnamed: 0,simulant_id,household_id,survey_date,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,id,address
0,0_10873,0_4411,2019-01-29,Betty,P,Todd,86,09/03/1932,2403,magnolia park rd,,Anytown,WA,00000,Household,Opposite-sex spouse,Female,White,10231,2403 magnolia park rd Anytown WA 00000
1,0_10344,0_4207,2019-03-26,Dina,P,Thomas,46,09/19/1972,4826,stone ridge ln,,Anytown,WA,00000,Household,Opposite-sex spouse,Female,Black,10232,4826 stone ridge ln Anytown WA 00000
2,0_10345,0_4207,2019-03-26,Fiona,M,Thomas,12,,4826,stone ridge ln,,Anytown,WA,00000,Household,Biological child,Female,Multiracial or Other,10233,4826 stone ridge ln Anytown WA 00000
3,0_10346,0_4207,2019-03-26,Molly,A,Thomas,8,03/15/2010,4826,stone ridge ln,,Anytown,WA,00000,Household,Biological child,Female,Asian,10234,4826 stone ridge ln Anytown WA 00000
4,0_10347,0_4207,2019-03-26,Daniel,M,Thomas,18,11/29/2000,4826,stone ridge ln,,Anytown,WA,00000,Household,Stepchild,Male,AIAN,10235,4826 stone ridge ln Anytown WA 00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206,0_18143,0_14148,2040-06-19,Bentley,E,Baran,27,12/29/2012,2040,magnolia court,,Anytown,WA,00000,Household,Other nonrelative,Male,White,10437,2040 magnolia court Anytown WA 00000
207,0_24866,0_14148,2040-06-19,Jordan,A,Kute,45,12/18/1994,2040,magnolia court,,Anytown,WA,00000,Household,Reference person,Female,White,10438,2040 magnolia court Anytown WA 00000
208,0_18238,0_14832,2040-08-14,Sarah,L,Magana,52,07/14/1988,,26th street northeast,,Anytown,WA,00000,Household,Reference person,Female,Latino,10439,
209,0_9712,0_15932,2041-03-26,Mom,S,Dixon,86,07/06/1954,3619,hanging moss loop,,Anytown,WA,00000,Household,Reference person,Female,AIAN,10440,3619 hanging moss loop Anytown WA 00000


Because we are using all years of ACS surveys, and only the 2020 Decennial Census, there will be respondents to ACS surveys which were not in the 2020 Census because they moved to Anytown or were born after the 2020 Census was conducted. Let's check how many of these there are and display some of them.

In [70]:
not_in_census = acs[~acs.simulant_id.isin(census.simulant_id)]
display(len(not_in_census.index))
not_in_census.head(5)

54

Unnamed: 0,simulant_id,household_id,survey_date,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,id,address
4,0_10347,0_4207,2019-03-26,Daniel,M,Thomas,18,11/29/2000,4826,stone ridge ln,,Anytown,WA,0,Household,Stepchild,Male,AIAN,10235,4826 stone ridge ln Anytown WA 00000
16,0_2713,0_14807,2037-10-13,Kristi,J,Bajn,53,08/06/1984,6045,s patterson pl,,Anytown,WA,0,Household,Reference person,Female,,10247,6045 s patterson pl Anytown WA 00000
73,0_19708,0_3,2020-04-21,Ariana,E,Camarillo Leyva,20,08/05/1999,8203,west farwell avenue,,Anytown,WA,0,Carceral,Noninstitutionalized group quarters population,Female,Asian,10304,8203 west farwell avenue Anytown WA 00000
93,0_16379,0_2246,2037-03-31,Gregory,J,Mustafa,73,05/30/1963,12679,kingston ave,,Anytown,WA,0,Household,Other nonrelative,Male,Latino,10324,12679 kingston ave Anytown WA 00000
101,0_1251,0_9630,2023-10-31,,M,Cotfmaj,19,05/06/2004,17947,newman dr,,Anytown,WA,0,Household,Reference person,Male,White,10332,17947 newman dr Anytown WA 00000


Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each


It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing.


In [71]:
from splink import DuckDBAPI
from splink.exploratory import completeness_chart

completeness_chart(dfs, db_api=DuckDBAPI(), table_names_for_chart=["census", "acs"])

In [72]:
from splink.exploratory import profile_columns

profile_columns(dfs, db_api=DuckDBAPI())

You will notice that six addresses have many more simulants living at them than the others. Simulants in our fictional population may live either in a residential household, or in group quarters (GQ), which models institutional and non-institutional GQ establishments: carceral, nursing homes, and other institutional, and college, military, and other non-institutional. Each of these addresses simulates one of these six types of GQ "households".

In the ACS data there are 103 unique street names (including typos), with 58 residents of the West Farwell Avenue GQ represented.

In [73]:
acs.street_name.value_counts()

street_name
west farwell avenue    58
grove street            6
n holman st             4
stone ridge ln          4
glenview rd             3
                       ..
black mountain rd       1
ecton la                1
w 47th st               1
kelton ave              1
b pl nw                 1
Name: count, Length: 103, dtype: int64

Next let's come up with some candidate blocking rules, which define the record comparisons to generate, and have a look at how many comparisons each rule will generate.

For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least _one_ of the following conditions holding.


In [74]:
from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

blocking_rules = [
    block_on("first_name"),
    block_on("last_name"),
    block_on("date_of_birth"),
    block_on("street_name"),
    block_on("age", "sex"),
    block_on("age", "race_ethnicity"),
    block_on("age", "middle_initial"),
    block_on("street_number"),
    block_on("middle_initial", "sex", "race_ethnicity"),
]


db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=dfs,
    blocking_rules=blocking_rules,
    db_api=db_api,
    link_type="link_only",
    unique_id_column_name="id",
    source_dataset_column_name="source_dataset",
)

For columns like `age` that would create too many comparisons, we combine them with one or more other columns which would also create too many comparisons if blocked on alone.


Now we get our model's settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.

Since we know there should exist one simulant in the census for each simulant in the ACS, we can set `probability_two_random_records_match` using the sizes of each dataset.

In [75]:
from splink import Linker, SettingsCreator
import splink.comparison_library as cl


settings = SettingsCreator(
    unique_id_column_name="id",
    link_type="link_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.NameComparison("first_name", jaro_winkler_thresholds=[0.9]).configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("middle_initial").configure(term_frequency_adjustments=True),
        cl.NameComparison("last_name", jaro_winkler_thresholds=[0.9]).configure(
            term_frequency_adjustments=True
        ),
        cl.DamerauLevenshteinAtThresholds(
            "date_of_birth", distance_threshold_or_thresholds=[1]
        ),
        cl.DamerauLevenshteinAtThresholds("address").configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("sex"),
        cl.ExactMatch("race_ethnicity").configure(term_frequency_adjustments=True),
    ],
    retain_intermediate_calculation_columns=True,
    probability_two_random_records_match=dfs[1].size / dfs[0].size,
)


linker = Linker(
    dfs, settings, db_api=DuckDBAPI(), input_table_aliases=["census", "acs"]
)

### Estimating model parameters


Next we estimate `u` and `m` values for each comparison, so that we can move to generating predictions.


In [76]:
# We generally recommend setting max pairs higher (e.g. 1e7 or more)
# But this will run faster for the purpose of this demo
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - middle_initial (no m values are trained).
    - last_name (no m values are trained).
    - date_of_birth (no m values are trained).
    - address (no m values are trained).
    - sex (no m values are trained).
    - race_ethnicity (no m values are trained).


When training the `m` values using expectation maximisation, we need some more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.

We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. `date_of_birth`, then we cannot compute the `m` values for the `date_of_birth` comparison, as we have only looked at records where these match.


In [77]:
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("date_of_birth"), estimate_without_term_frequencies=True
)
session_ln = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("last_name"), estimate_without_term_frequencies=True
)


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - middle_initial
    - last_name
    - address
    - sex
    - race_ethnicity

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth

Iteration 1: Largest change in params was -0.444 in the m_probability of race_ethnicity, level `Exact match on race_ethnicity`
Iteration 2: Largest change in params was -0.00386 in probability_two_random_records_match
Iteration 3: Largest change in params was -0.000347 in the m_probability of sex, level `All other comparisons`
Iteration 4: Largest change in params was 4.33e-05 in the m_probability of sex, level `Exact match on sex`

EM converged after 4 iterations

Your model is not yet fully trained. Missing estimates for:
    - date_of_birth (no m valu

If we wish we can have a look at how our parameter estimates changes over these training sessions


In [78]:
session_dob.m_u_values_interactive_history_chart()

In [79]:
session_ln.m_u_values_interactive_history_chart()

For variables that aren't used in the `m`-training blocking rules, we have two estimates --- one from each of the training sessions (see for example `address`). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.


In [80]:
linker.visualisations.parameter_estimate_comparisons_chart()

In [81]:
import json

# we can have a look at the full settings if we wish, including the values of our estimated parameters:
print(json.dumps(linker._settings_obj.as_dict(), indent=2))
# we can also get a handy summary of of the model in an easily readable format if we wish:
# print(linker._settings_obj.human_readable_description)
# (we suppress output here for brevity)

{
  "link_type": "link_only",
  "probability_two_random_records_match": 0.020623594956504742,
  "retain_matching_columns": true,
  "retain_intermediate_calculation_columns": true,
  "additional_columns_to_retain": [],
  "sql_dialect": "duckdb",
  "linker_uid": "onk9cuw1",
  "em_convergence": 0.0001,
  "max_iterations": 25,
  "bayes_factor_column_prefix": "bf_",
  "term_frequency_adjustment_column_prefix": "tf_",
  "comparison_vector_value_column_prefix": "gamma_",
  "unique_id_column_name": "id",
  "source_dataset_column_name": "source_dataset",
  "blocking_rules_to_generate_predictions": [
    {
      "blocking_rule": "l.\"first_name\" = r.\"first_name\"",
      "sql_dialect": "duckdb"
    },
    {
      "blocking_rule": "l.\"last_name\" = r.\"last_name\"",
      "sql_dialect": "duckdb"
    },
    {
      "blocking_rule": "l.\"date_of_birth\" = r.\"date_of_birth\"",
      "sql_dialect": "duckdb"
    },
    {
      "blocking_rule": "l.\"street_name\" = r.\"street_name\"",
      "sql_di

We can now visualise some of the details of our model. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.


In [82]:
linker.visualisations.match_weights_chart()

As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual `u` and `m` parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see [this article](https://www.robinlinacre.com/maths_of_fellegi_sunter/)). We might want to revise aspects of our model based on the information we ascertain here.

Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.


In [83]:
linker.visualisations.m_u_parameters_chart()

It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out be seeing whether records are able to be matched with themselves.

We have low column missingness, so all of our records are linkable for almost all match thresholds.


In [84]:
linker.evaluation.unlinkables_chart()

### Predictions

In [85]:
predictions = linker.inference.predict()  # include all match_probabilities
df_predictions = predictions.as_pandas_dataframe()
df_predictions

Blocking time: 0.09 seconds


Predict time: 2.33 seconds


Unnamed: 0,match_weight,match_probability,source_dataset_l,source_dataset_r,id_l,id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,...,tf_race_ethnicity_r,bf_race_ethnicity,bf_tf_adj_race_ethnicity,street_number_l,street_number_r,age_l,age_r,street_name_l,street_name_r,match_key
0,46.335765,1.000000,acs,census,10243,5787,Clarence,Clarence,2,0.000873,...,,1.000000,1.000000,2540,2540,59,60,grand river boulevard east,grand river boulevard east,0
1,-7.897828,0.004175,acs,census,10246,4264,Mark,Mark,2,0.007952,...,0.452418,1.944362,0.573729,84,2044,65,57,nichole dr,heyden,0
2,-9.719486,0.001185,acs,census,10248,3323,Edward,Edward,2,0.003588,...,0.452418,1.944362,0.573729,20,1129,58,63,oakbridge py,sanford ave,0
3,-11.269564,0.000405,acs,census,10251,9053,Joyce,Joyce,2,0.000873,...,0.452418,1.944362,0.573729,8203,8603,75,55,west farwell avenue,nw 302nd st,0
4,-8.059410,0.003735,acs,census,10252,2305,Joseph,Joseph,2,0.005334,...,0.452418,0.668947,1.000000,8203,69,37,53,west farwell avenue,s hwy 701,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52168,-14.947738,0.000032,acs,census,10237,829,Karla,Tracy,0,0.000291,...,0.452418,1.944362,0.573729,5101,25846,46,50,garden st ext,s 216th ave,8
52169,-14.947738,0.000032,acs,census,10237,747,Karla,Jacqueline,0,0.000291,...,0.452418,1.944362,0.573729,5101,113,46,90,garden st ext,east 219 street,8
52170,-14.947738,0.000032,acs,census,10237,618,Karla,Jessica,0,0.000291,...,0.452418,1.944362,0.573729,5101,122,46,38,garden st ext,rue royale,8
52171,-14.947738,0.000032,acs,census,10237,490,Karla,Ruth,0,0.000291,...,0.452418,1.944362,0.573729,5101,1231,46,77,garden st ext,riverside dr,8


We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things. The chart below shows that for a match weight of 14.4 (probability 99.995%), we our model has 0 false positives and 2 false negatives.


In [86]:
linker.evaluation.accuracy_analysis_from_labels_column(
    "simulant_id", output_type="accuracy"
)

Blocking time: 0.08 seconds


Predict time: 2.30 seconds


and we can easily see how many individuals we identify and link by looking at clusters generated at some threshold match probability of interest - let's choose 99.995% again for this example.


In [87]:
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    predictions, threshold_match_probability=0.999
)
df_clusters = clusters.as_pandas_dataframe().sort_values("cluster_id")
df_clusters.groupby("cluster_id").size().value_counts()

Completed iteration 1, num representatives needing updating: 0


1    10127
2      150
3        5
Name: count, dtype: int64

In [88]:
linker.visualisations.cluster_studio_dashboard(
    predictions,
    clusters,
    "cluster_studio.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

# You can view the cluster_studio.html file in your browser, or inline in a notbook as follows
from IPython.display import IFrame

IFrame(src="./cluster_studio.html", width="100%", height=1000)

In [89]:
linker.visualisations.comparison_viewer_dashboard(
    predictions, "scv.html", overwrite=True
)
IFrame(src="./scv.html", width="100%", height=1000)

In this case, we happen to know what the true links are, so we can manually inspect the ones that are doing worst to see what our model is not capturing - i.e. where we have false negatives.

Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with false positives.

Ordinarily we would not have this luxury, and so would need to dig a bit deeper for clues as to how to improve our model, such as manually inspecting records across threshold probabilities.


In [90]:
# add simulant_id_l and simulant_id_r columns by looking up id_l and id_r in the simulant id lookup table
df_predictions = pd.merge(
    df_predictions, simulant_lookup, left_on="id_l", right_on="id", how="left"
)
df_predictions = df_predictions.rename(columns={"simulant_id": "simulant_id_l"})
df_predictions = pd.merge(
    df_predictions, simulant_lookup, left_on="id_r", right_on="id", how="left"
)
df_predictions = df_predictions.rename(columns={"simulant_id": "simulant_id_r"})

In [91]:
# sort links by lowest match_probability to see if we missed any
links = df_predictions[
    df_predictions["simulant_id_l"] == df_predictions["simulant_id_r"]
].sort_values("match_weight")
# sort nonlinks by highest match_probability to see if we matched any
nonlinks = df_predictions[
    df_predictions["simulant_id_l"] != df_predictions["simulant_id_r"]
].sort_values("match_weight", ascending=False)
links

Unnamed: 0,match_weight,match_probability,source_dataset_l,source_dataset_r,id_l,id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,...,bf_tf_adj_race_ethnicity,street_number_l,street_number_r,age_l,age_r,street_name_l,street_name_r,match_key,simulant_id_l,simulant_id_r
5041,7.313436,0.993752,acs,census,10354,3240,Rmily,Emily,0,0.000097,...,0.573729,8,610,33,27,kline st,n 54th ln,1,0_4752,0_4752
916,7.530625,0.994621,acs,census,10286,10228,Benjamin,Benjamin,2,0.005625,...,1.000000,8203,2002,19,19,west farwell avenue,203rd pl se,0,0_19556,0_19556
6175,14.331459,0.999951,acs,census,10323,1151,Fxigy,Faigy,0,0.000097,...,0.573729,12679,12679,56,39,kingston ave,kingston ave,2,0_5501,0_5501
1421,14.977692,0.999969,acs,census,10246,7275,Mark,Mark,2,0.007952,...,0.573729,84,84,65,66,nichole dr,nichole dr,0,0_4493,0_4493
5701,14.990600,0.999969,acs,census,10297,1982,Harold,Harry,0,0.001261,...,0.573729,8203,8203,37,37,,west farwell avenue,1,0_19681,0_19681
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,46.335765,1.000000,acs,census,10243,5787,Clarence,Clarence,2,0.000873,...,1.000000,2540,2540,59,60,grand river boulevard east,grand river boulevard east,0,0_13545,0_13545
67,46.770588,1.000000,acs,census,10378,6006,Eniya,Eniya,2,0.000194,...,2.045657,617,617,27,17,cadillac st,cadillac st,0,0_14643,0_14643
102,46.891002,1.000000,acs,census,10385,8393,Tonya,Tonya,2,0.000582,...,2.045657,7214,7214,56,44,wild plum ct,wild plum ct,0,0_10892,0_10892
397,47.465291,1.000000,acs,census,10330,2167,Charlene,Charlene,2,0.000388,...,2.045657,7382,7382,47,43,jamison dr,jamison dr,0,0_16550,0_16550


In [92]:
nonlinks

Unnamed: 0,match_weight,match_probability,source_dataset_l,source_dataset_r,id_l,id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,...,bf_tf_adj_race_ethnicity,street_number_l,street_number_r,age_l,age_r,street_name_l,street_name_r,match_key,simulant_id_l,simulant_id_r
6065,12.117130,9.997749e-01,acs,census,10240,7574,Francis,Angela,0,0.001261,...,1.000000,1432,1432,48,48,isaac place,isaac place,1,0_8685,0_8684
4913,11.585942,9.996748e-01,acs,census,10385,8394,Tonya,Ryan,0,0.000582,...,1.000000,7214,7214,56,12,wild plum ct,wild plum ct,1,0_10892,0_10894
4923,11.146299,9.995590e-01,acs,census,10328,6177,Charles,Lorraine,0,0.007079,...,1.607127,1009,1009,91,85,northwest topeka boulevard,northwest topeka boulevard,1,0_10954,0_10955
4985,9.217284,9.983228e-01,acs,census,10311,8114,David,Alex,0,0.013576,...,1.000000,6634,6634,53,18,beachplum way,beachplum way,1,0_16384,0_16389
5092,9.217284,9.983228e-01,acs,census,10311,8113,David,Frederick,0,0.013576,...,1.000000,6634,6634,53,18,beachplum way,beachplum way,1,0_16384,0_16388
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29578,-31.079598,4.406651e-10,acs,census,10346,1760,Gerard,Aria,0,0.000291,...,1.000000,158,158,43,0,hamilton avenue,black mountain rd,7,0_15058,0_20223
29580,-31.079598,4.406651e-10,acs,census,10348,1760,Liam,Aria,0,0.001164,...,1.000000,158,158,12,0,hamilton avenue,black mountain rd,7,0_15062,0_20223
30074,-31.079598,4.406651e-10,acs,census,10236,6634,Edwin,Kendra,0,0.000776,...,1.000000,925,925,55,25,sawmill rd,tr 124,7,0_13280,0_4572
30076,-31.079598,4.406651e-10,acs,census,10240,857,Francis,Naira,0,0.001261,...,1.000000,1432,1432,48,2,isaac place,erdman rd,7,0_8685,0_2460


As we saw in the chart showing false positives and false negatives, our model does pretty well when we choose a very high match threshold. The small size of the ACS dataset reduces the chances for noise to make linking difficult. For additional challenges, try using a larger dataset like W2, or further increasing the noise levels.

We can also view the true links and nonlinks as waterfall charts to see how we ended up at our final match weights, and the ROC chart showing false positives and negatives using match thresholds.

In [93]:
records_to_view = 5
linker.visualisations.waterfall_chart(
    links.head(records_to_view).to_dict(orient="records")
)

In [94]:
records_to_view = 5
linker.visualisations.waterfall_chart(
    nonlinks.head(records_to_view).to_dict(orient="records")
)

In [95]:
linker.evaluation.accuracy_analysis_from_labels_column("simulant_id", output_type="roc")

Blocking time: 0.08 seconds
Predict time: 0.16 seconds


We may wish to evaluate the effects of using term frequencies for some columns, such as `address`, by looking at examples of the values `tf_address` for both common and uncommon address values.

In [96]:
common_addr = "8203 west farwell avenue  Anytown WA 00000"
records_to_view = 5
linker.visualisations.waterfall_chart(
    links[links["address_l"] == common_addr]
    .head(records_to_view)
    .to_dict(orient="records")
)

In [97]:
uncommon_addr = "820 cameron road  Anytown WA 00000"
records_to_view = 5
linker.visualisations.waterfall_chart(
    links[links["address_l"] == uncommon_addr]
    .head(records_to_view)
    .to_dict(orient="records")
)