# Simulated PIK statistics

Here we inspect the accuracy and characteristics of the PIKs assigned,
leveraging our knowledge of ground truth from pseudopeople.

It wouldn't be possible to do the ground truth part with the real PVS, but
Layne, Wagner, and Rothhaas did something similar by redacting SSN from real records,
sending them through PVS without the SSN, and then using the true SSN
as ground truth.
The health care records they used are probably quite different from a CUF,
but they found a **very** good overall PIK accuracy (see cell below).

In [1]:
import datetime

from vivarium_research_prl import distributed_compute, utils

In [2]:
print(datetime.datetime.now())

2024-02-12 11:56:14.327423


In [3]:
# DO NOT EDIT if this notebook is not called ground_truth_accuracy.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
data_to_use = 'small_sample'
simulated_data_output_dir = 'generate_simulated_data/output'
case_study_output_dir = 'output'

# The "compute engine" is what we use on the Python side
# for our case-study-specific operations,
# as opposed to the Splink engine
compute_engine = 'pandas'
# Only matter if using a distributed compute engine
compute_engine_num_jobs = 3
compute_engine_cpus_per_job = 2
compute_engine_memory_per_job = "5GB"

In [4]:
# Parameters for a USA run
# data_to_use = "usa"
# simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
# case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"

# compute_engine = 'dask'
# compute_engine_num_jobs = 50
# compute_engine_memory_per_job = "120GB"
# compute_engine_cpus_per_job = 2

In [5]:
case_study_output_dir = f'{case_study_output_dir}/{data_to_use}'
simulated_data_output_dir = f'{simulated_data_output_dir}/{data_to_use}'

In [6]:
import os
from pathlib import Path
os.environ["PATH"] = f"{Path('./slurm_within_singularity').resolve()}:{os.environ['PATH']}"

In [7]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=compute_engine_num_jobs,
    cpus_per_job=compute_engine_cpus_per_job,
    memory_per_job=compute_engine_memory_per_job,
)

In [8]:
census_2030_piked = df_ops.read_parquet(f'{case_study_output_dir}/census_2030_piked.parquet')
confirmed_piks_with_ground_truth = df_ops.read_parquet(f'{case_study_output_dir}/confirmed_piks.parquet')

In [9]:
piked_proportion = df_ops.compute(census_2030_piked.pik.notnull().mean())
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

89.51% of the input records were PIKed


In [10]:
# Multiple Census rows assigned the same PIK, indicating the model thinks they are duplicates in Census
pik_sizes = df_ops.persist(df_ops.groupby_agg_small_groups(census_2030_piked, by='pik', agg_func=lambda x: x.size()))
df_ops.compute(pik_sizes.value_counts())

1    9804
2      34
Name: count, dtype: int64

In [11]:
# Interesting: in pseudopeople, sometimes siblings are assigned the same (common) first name, making them almost identical.
# The only giveaway is their age and DOB.
# Presumably, this tends not to happen in real life.
duplicate_piks = pik_sizes.rename('pik_size').reset_index().pipe(lambda df: df[df.pik_size > 1])

df_ops.head(census_2030_piked.merge(duplicate_piks, on="pik")).sort_values('pik')

Unnamed: 0,record_id,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,pik,pik_size
6,simulated_census_2030_5106,0_2876,Baker,A,Stevens,1.0,05/01/2006,115,kensington drive,,Anytown,WA,0,Household,Grandchild,Male,White,2030,101095,2
7,simulated_census_2030_5101,0_2876,Logan,L,Stevens,23.0,05/01/2006,115,kensington drive,,Anytown,WA,0,Household,Biological child,Male,White,2030,101095,2
0,simulated_census_2030_4236,0_3730,Stephanie,M,Segura,59.0,05/20/1985,1245,cs lane,,Anytown,WA,0,Household,Opposite-sex spouse,Female,Latino,2030,105159,2
1,simulated_census_2030_4247,0_3730,Austin,A,Segura,,05/20/1985,1245,cs lane,,Anytown,WA,0,Household,Stepchild,Male,Latino,2030,105159,2
2,simulated_census_2030_10655,0_10982,Kelly,L,Kersh,27.0,07/21/1995,33372,illinois st,,Anytown,WA,0,Household,Reference person,Female,White,2030,106942,2
3,simulated_census_2030_10657,0_10982,Christopher,B,Kersh,34.0,07/21/1995,33372,illinois st,,Anytown,WA,0,Household,Child-in-law,Male,Asian,2030,106942,2
8,simulated_census_2030_8075,0_3286,Clarissa,M,Fraile,20.0,08/21/1972,2880,russell street,,Anytown,WA,0,Household,Biological child,Female,Latino,2030,93816,2
9,simulated_census_2030_8073,0_3286,Lynette,M,Fraile,57.0,08/21/1972,2880,russell street,,Anytown,WA,0,Household,Reference person,Female,Latino,2030,93816,2
4,simulated_census_2030_3635,0_6290,Carrie,J,Will,57.0,02/01/1973,19517,37tyh strest,,Anytown,WA,0,Household,Reference person,Female,White,2030,93908,2
5,simulated_census_2030_3636,0_6290,Morris,J,Will,58.0,02/01/1973,19517,37th street,,Anytown,WA,0,Household,Opposite-sex spouse,Male,White,2030,93908,2


## Ground truth statistics

In [12]:
census_2030_ground_truth = df_ops.persist(
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_census_2030_ground_truth.parquet')
)

In [13]:
# In this version of pseudopeople, there are no actual duplicates in Census,
# which means all of the duplicates identified above are wrong.
assert len(census_2030_ground_truth) == len(df_ops.drop_duplicates(census_2030_ground_truth))

In [14]:
reference_files_ground_truth = df_ops.persist(df_ops.concat([
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_geobase_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_name_dob_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
], ignore_index=True))

In [15]:
# However, there can be reference file records that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(reference_files_ground_truth, by='record_id', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    51390
2     1270
3       40
Name: count, dtype: int64

In [16]:
reference_files_ground_truth = df_ops.persist(reference_files_ground_truth.merge(
    n_unique_simulants,
    on='record_id',
    how='left',
))
reference_files_ground_truth.head(n=100)

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_26168,0_730,1
1,simulated_geobase_reference_file_1,0_1366,1
2,simulated_geobase_reference_file_2,0_1366,1
3,simulated_geobase_reference_file_26970,0_1366,1
4,simulated_geobase_reference_file_26971,0_1366,1
...,...,...,...
95,simulated_geobase_reference_file_29556,0_16931,1
96,simulated_geobase_reference_file_437,0_11586,1
97,simulated_geobase_reference_file_28504,0_12845,1
98,simulated_geobase_reference_file_27884,0_2291,1


In [17]:
df_ops.head(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == df_ops.compute(reference_files_ground_truth.n_unique_simulants.max())])

Unnamed: 0,record_id,simulant_id,n_unique_simulants
6847,simulated_geobase_reference_file_10101,0_2724,3
6848,simulated_geobase_reference_file_10101,0_2723,3
6849,simulated_geobase_reference_file_10101,0_22864,3
7960,simulated_geobase_reference_file_499,0_22665,3
7961,simulated_geobase_reference_file_499,0_18787,3
7962,simulated_geobase_reference_file_499,0_18785,3
8586,simulated_geobase_reference_file_11295,0_21546,3
8587,simulated_geobase_reference_file_11295,0_21547,3
8588,simulated_geobase_reference_file_11295,0_3928,3
8740,simulated_geobase_reference_file_13684,0_8145,3


In [18]:
census_2030_ground_truth = df_ops.persist(census_2030_ground_truth.merge(
    df_ops.drop_duplicates(reference_files_ground_truth[['simulant_id']]).assign(possible_to_pik=1),
    on='simulant_id',
    how='left',
).assign(possible_to_pik=lambda df: df.possible_to_pik.fillna(0)))
possible_to_pik_proportion = df_ops.compute(census_2030_ground_truth.possible_to_pik.mean())
print(
    f'{(1 - possible_to_pik_proportion):.2%} of the input records are '
    'impossible to PIK correctly, since they are not in any reference files'
)

0.45% of the input records are impossible to PIK correctly, since they are not in any reference files


In [19]:
print(
    f'Assigned PIKs to {(piked_proportion / possible_to_pik_proportion):.2%} of PIK-able records'
)

Assigned PIKs to 89.92% of PIK-able records


In [20]:
reference_file = df_ops.concat([
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_geobase_reference_file.parquet',
    ),
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_name_dob_reference_file.parquet',
    ),
], ignore_index=True)

In [21]:
reference_file_piks = df_ops.persist(reference_file[['record_id', 'pik']])
reference_file_piks

Unnamed: 0,record_id,pik
0,simulated_geobase_reference_file_0,105906
1,simulated_geobase_reference_file_1,104653
2,simulated_geobase_reference_file_2,104653
3,simulated_geobase_reference_file_3,106223
4,simulated_geobase_reference_file_4,106223
...,...,...
52695,simulated_name_dob_reference_file_19870,108816
52696,simulated_name_dob_reference_file_19871,108817
52697,simulated_name_dob_reference_file_19872,108818
52698,simulated_name_dob_reference_file_19873,108819


In [22]:
assert len(reference_file_piks) == len(df_ops.drop_duplicates(reference_file_piks[['record_id']]))

In [23]:
pik_simulant_pairs = df_ops.persist(df_ops.drop_duplicates(reference_files_ground_truth.merge(reference_file_piks, on='record_id')[['pik', 'simulant_id']]))

In [24]:
# However, there can be PIKs that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(pik_simulant_pairs, by='pik', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    17895
2     1034
3       50
Name: count, dtype: int64

In [25]:
pik_simulant_pairs = df_ops.persist(pik_simulant_pairs.merge(
    n_unique_simulants,
    on='pik',
    how='left',
))
pik_simulant_pairs

Unnamed: 0,pik,simulant_id,n_unique_simulants
0,107021,0_13874,2
1,105809,0_4544,2
2,107244,0_13421,2
3,89195,0_16434,2
4,107277,0_10060,2
...,...,...,...
20108,108804,0_23319,1
20109,108808,0_1328,1
20110,108810,0_23720,1
20111,108814,0_5904,1


In [26]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == df_ops.compute(pik_simulant_pairs.n_unique_simulants.max())])

Unnamed: 0,pik,simulant_id,n_unique_simulants
46,97161,0_15910,3
49,98180,0_11523,3
59,99589,0_18049,3
111,104948,0_4700,3
121,106421,0_20899,3
124,106950,0_4702,3
125,107144,0_21544,3
182,94658,0_2724,3
183,94658,0_22864,3
206,92980,0_8607,3


## Definitions of accuracy

1. (most strict) Assigning any PIK with multiple simulants is incorrect
2. Assigning a PIK with multiple simulants is neither incorrect nor correct (excluded from denominator)
3. (most lenient) Assigning a PIK with multiple simulants is correct, as long as at least one of those simulants matches the truth

In [27]:
# All modules, Medicare database, calculated from Layne, Wagner, and Rothhaas Table 1 (p. 15)
real_life_pvs_accuracy = 1 - (2_585 + 60_709 + 129_480 + 89_094) / (52_406_981 + 5_170_924 + 49_374_794 + 50_327_034)
f'{real_life_pvs_accuracy:.5%}'

'99.82079%'

### Definition 1

In [28]:
piks_assigned = df_ops.compute(census_2030_piked.pik.notnull().sum())
piks_assigned

9872

In [29]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants > 1])

Unnamed: 0,pik,simulant_id,n_unique_simulants
0,107021,0_13874,2
1,105809,0_4544,2
2,107244,0_13421,2
3,89195,0_16434,2
4,107277,0_10060,2
5,89332,0_10627,2
6,89445,0_4798,2
7,89628,0_2545,2
8,108105,0_15774,2
9,89695,0_6387,2


In [30]:
single_sim_piks_correct = df_ops.compute(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_piks_correct

9042

In [31]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / piks_assigned
)

0.9159238249594813

In [32]:
assert len(confirmed_piks_with_ground_truth) == piks_assigned

In [33]:
df_ops.head(census_2030_ground_truth.rename(columns={'record_id': 'record_id_census_2030'}))

Unnamed: 0,record_id_census_2030,simulant_id,possible_to_pik
0,simulated_census_2030_0,0_923,1.0
1,simulated_census_2030_1,0_2348,1.0
2,simulated_census_2030_2,0_2641,1.0
3,simulated_census_2030_3,0_6176,1.0
4,simulated_census_2030_4,0_10251,1.0
5,simulated_census_2030_5,0_13047,1.0
6,simulated_census_2030_6,0_13861,1.0
7,simulated_census_2030_7,0_13972,1.0
8,simulated_census_2030_8,0_13973,1.0
9,simulated_census_2030_9,0_13974,1.0


In [34]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_correct = df_ops.compute(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_record_links_correct

9235

In [35]:
(
    single_sim_record_links_correct / piks_assigned
)

0.9354740680713128

### Definition 2

In [36]:
single_sim_piks_assigned = len(census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == 1][['pik', 'simulant_id']]))
single_sim_piks_assigned

9075

In [37]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / single_sim_piks_assigned
)

0.9963636363636363

In [38]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_assigned = df_ops.compute(
    (confirmed_piks_with_ground_truth
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .n_unique_simulants == 1).sum()
)
single_sim_record_links_assigned

9268

In [39]:
(
    single_sim_record_links_correct / single_sim_record_links_assigned
)

0.9964393612429866

### Definition 3

In [40]:
pik_simulant_pairs

Unnamed: 0,pik,simulant_id,n_unique_simulants
0,107021,0_13874,2
1,105809,0_4544,2
2,107244,0_13421,2
3,89195,0_16434,2
4,107277,0_10060,2
...,...,...,...
20108,108804,0_23319,1
20109,108808,0_1328,1
20110,108810,0_23720,1
20111,108814,0_5904,1


In [41]:
piks_at_least_partially_correct = df_ops.persist(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(df_ops.drop_duplicates)
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id", "pik"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
piks_at_least_partially_correct

Unnamed: 0,record_id,pik,correct
0,simulated_census_2030_0,89484,True
1,simulated_census_2030_1,98736,True
2,simulated_census_2030_10,94481,True
3,simulated_census_2030_100,100835,True
4,simulated_census_2030_1000,93179,True
...,...,...,...
9867,simulated_census_2030_9995,100981,True
9868,simulated_census_2030_9996,95280,True
9869,simulated_census_2030_9997,101556,True
9870,simulated_census_2030_9998,98532,True


In [42]:
# Overall accuracy, treating it as a black box
piks_correct_proportion = (df_ops.compute(piks_at_least_partially_correct.correct.sum()) / piks_assigned)
piks_correct_proportion

0.9964546191247974

In [43]:
print(f'{piks_correct_proportion:.5%} of the PIKs assigned were correct; compare with {real_life_pvs_accuracy:.5%} in real life')

99.64546% of the PIKs assigned were correct; compare with 99.82079% in real life


In [44]:
# Looking at whether the exact *record* linked was from the same simulant
sim_record_links_at_least_partially_correct = df_ops.persist(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id_raw_input_file", "record_id_reference_file", "pik", "module_name", "pass_name"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
sim_record_links_at_least_partially_correct

Unnamed: 0,record_id_raw_input_file,record_id_reference_file,pik,module_name,pass_name,correct
0,simulated_census_2030_0,simulated_geobase_reference_file_951,89484,geosearch,geokey,True
1,simulated_census_2030_1,simulated_geobase_reference_file_17348,98736,geosearch,geokey,True
2,simulated_census_2030_10,simulated_geobase_reference_file_9789,94481,geosearch,geokey,True
3,simulated_census_2030_100,simulated_geobase_reference_file_21248,100835,geosearch,some name and DOB information,True
4,simulated_census_2030_1000,simulated_geobase_reference_file_7496,93179,geosearch,geokey,True
...,...,...,...,...,...,...
9867,simulated_census_2030_9995,simulated_geobase_reference_file_21528,100981,hhcompsearch,year of birth,True
9868,simulated_census_2030_9996,simulated_geobase_reference_file_11208,95280,geosearch,house number and street name Soundex,True
9869,simulated_census_2030_9997,simulated_geobase_reference_file_22666,101556,geosearch,geokey,True
9870,simulated_census_2030_9998,simulated_geobase_reference_file_17011,98532,geosearch,geokey,True


In [45]:
len(sim_record_links_at_least_partially_correct)

9872

In [46]:
len(sim_record_links_at_least_partially_correct[['record_id_raw_input_file', 'record_id_reference_file']].drop_duplicates())

9872

In [47]:
(
    df_ops.compute(sim_record_links_at_least_partially_correct.correct.sum()) / piks_assigned
)

0.9964546191247974

In [48]:
assert df_ops.compute((df_ops.groupby_agg_small_groups(confirmed_piks_with_ground_truth, by='record_id_raw_input_file', agg_func=lambda x: x.record_id_reference_file.nunique()) <= 1).all())

In [49]:
# Using definition 3 -- at the PIK level
piks_at_least_partially_correct = df_ops.persist(
    piks_at_least_partially_correct
        .rename(columns={'record_id': 'record_id_raw_input_file'})
        .merge(confirmed_piks_with_ground_truth[['record_id_raw_input_file', 'module_name', 'pass_name']], on='record_id_raw_input_file')
)
piks_at_least_partially_correct

Unnamed: 0,record_id_raw_input_file,pik,correct,module_name,pass_name
0,simulated_census_2030_0,89484,True,geosearch,geokey
1,simulated_census_2030_1,98736,True,geosearch,geokey
2,simulated_census_2030_10,94481,True,geosearch,geokey
3,simulated_census_2030_100,100835,True,geosearch,some name and DOB information
4,simulated_census_2030_1000,93179,True,geosearch,geokey
...,...,...,...,...,...
9867,simulated_census_2030_9995,100981,True,hhcompsearch,year of birth
9868,simulated_census_2030_9996,95280,True,geosearch,house number and street name Soundex
9869,simulated_census_2030_9997,101556,True,geosearch,geokey
9870,simulated_census_2030_9998,98532,True,geosearch,geokey


In [50]:
# Accuracy by module -- note that this shows the opposite pattern (with the sample data)
# relative to the results of Layne et al., who found GeoSearch was much *more* accurate
df_ops.compute(piks_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
geosearch,0.996231,9287
dobsearch,1.0,161
hhcompsearch,1.0,38
namesearch,1.0,386


In [51]:
# Accuracy by pass -- could be used to tune pass-specific cutoffs, but
# this might not be too informative while we are still using the sample data.
df_ops.compute(piks_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
geosearch,geokey,0.995099,6734
geosearch,house number and street name Soundex,0.996639,595
dobsearch,reverse Soundex of name,1.0,31
dobsearch,first two characters of first name and year of...,1.0,130
geosearch,some name and DOB information,1.0,1958
hhcompsearch,initials,1.0,28
hhcompsearch,year of birth,1.0,10
namesearch,DOB and NYSIIS of name,1.0,207
namesearch,DOB and initials,1.0,128
namesearch,birthday and first two characters of name,1.0,47


In [52]:
# Using definition 3 -- at the link level
df_ops.compute(sim_record_links_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
geosearch,0.996231,9287
dobsearch,1.0,161
hhcompsearch,1.0,38
namesearch,1.0,386


In [53]:
df_ops.compute(sim_record_links_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
geosearch,geokey,0.995099,6734
geosearch,house number and street name Soundex,0.996639,595
dobsearch,reverse Soundex of name,1.0,31
dobsearch,first two characters of first name and year of...,1.0,130
geosearch,some name and DOB information,1.0,1958
hhcompsearch,initials,1.0,28
hhcompsearch,year of birth,1.0,10
namesearch,DOB and NYSIIS of name,1.0,207
namesearch,DOB and initials,1.0,128
namesearch,birthday and first two characters of name,1.0,47


In [54]:
df_ops.compute(sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct].groupby(["module_name", "pass_name"]).size()).sort_values()

module_name  pass_name                           
geosearch    house number and street name Soundex     2
             geokey                                  33
dtype: int64

### Incorrect and missed PIKs

In [55]:
incorrectly_linked_pairs = df_ops.drop_duplicates(
    sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct]
        [["record_id_raw_input_file", "record_id_reference_file"]]
)
incorrectly_linked_pairs

Unnamed: 0,record_id_raw_input_file,record_id_reference_file
267,simulated_census_2030_10264,simulated_geobase_reference_file_14945
671,simulated_census_2030_10655,simulated_geobase_reference_file_30175
839,simulated_census_2030_10816,simulated_geobase_reference_file_30491
1202,simulated_census_2030_1244,simulated_geobase_reference_file_26563
1772,simulated_census_2030_1860,simulated_geobase_reference_file_4106
1782,simulated_census_2030_187,simulated_geobase_reference_file_14416
1887,simulated_census_2030_1969,simulated_geobase_reference_file_29773
1914,simulated_census_2030_1994,simulated_geobase_reference_file_2235
2340,simulated_census_2030_2460,simulated_geobase_reference_file_11775
3449,simulated_census_2030_3636,simulated_geobase_reference_file_8776


In [56]:
len(incorrectly_linked_pairs)

35

In [57]:
comparison_cols = [
    "first_name",
    "middle_name",
    "last_name",
    "date_of_birth",
    "street_number",
    "street_name",
    "unit_number",
    "city",
    "state",
]

incorrect_links = (
    incorrectly_linked_pairs
        .merge(
            census_2030_piked
                .rename(columns={"record_id": "record_id_raw_input_file", "middle_initial": "middle_name"})
                [["record_id_raw_input_file"] + comparison_cols],
            on="record_id_raw_input_file",
            how="left",
        )
        .merge(
            reference_file
                .rename(columns={"record_id": "record_id_reference_file"})
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                [["record_id_reference_file"] + comparison_cols],
            on="record_id_reference_file",
            how="left",
            suffixes=("_census", "_reference_file"),
        )
)
def flatten(xss):
    return [x for xs in xss for x in xs]

df_ops.head(incorrect_links[flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])])

Unnamed: 0,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,Alise,Jaclyn,R,Karri,Burris,Burris,05/06/1990,19900506,57.0,57.0,s state college blvd,S STATE COLLEGE BLVD,,,Anytown,ANYTOWN,WA,WA
1,Kelly,Husband,L,Brian,Kersh,Kersh,07/21/1995,19950721,33372.0,33372.0,illinois st,ILLINOIS ST,,,Anytown,ANYTOWN,WA,WA
2,Amara,King,L,Louis,Minutella,Minutella,06/08/2027,20270608,,,sw 178th st,SW 178TH ST,,,Anytown,ANYTOWN,WA,WA
3,Rhyan,Cruz,E,Lucas,Gay,Gay,03/24/2020,20200324,5008.0,5008.0,clare ave,CLARE AVE,,,Anytown,ANYTOWN,WA,WA
4,Rebecca,Edward,A,Richard,Singletary,Singletary,05/05/1958,19580505,1686.0,1686.0,mukilteo blvd,MUKILTEO BLVD,,,Anytown,ANYTOWN,WA,WA
5,Remington,Domonique,F,Jacquelyn,Nkrumah,Nkrumah,12/11/1988,19881211,8173.0,8173.0,meridian ave nth,MERIDIAN AVE NTH,,,Anytown,ANYTOWN,WA,WA
6,Sophia,Ella,N,Mia,Solis Rivera,Solis Rivera,02/28/2026,20260228,59.0,59.0,nw 19 st,NW 19 ST,,,Anytown,ANYTOWN,WA,WA
7,Gary,Elaine,M,Alma,Dohoney,Dohoney,11/24/1950,19501124,84.0,84.0,atherton st,ATHERTON ST,,,Anytown,ANYTOWN,WA,WA
8,Brenda,Brarley,S,Christopher,Kelly,Kelly,10/19/1981,19811019,425.0,425.0,state highway 46,STATE HIGHWAY 46,,,Anytown,ANYTOWN,WA,WA
9,Morris,,J,Joy,Will,Will,02/01/1973,19730201,19517.0,19517.0,37th street,37TH STREET,,,Anytown,ANYTOWN,WA,WA


In [58]:
reference_files_ground_truth

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_26168,0_730,1
1,simulated_geobase_reference_file_1,0_1366,1
2,simulated_geobase_reference_file_2,0_1366,1
3,simulated_geobase_reference_file_26970,0_1366,1
4,simulated_geobase_reference_file_26971,0_1366,1
...,...,...,...
54045,simulated_name_dob_reference_file_19858,0_23319,1
54046,simulated_name_dob_reference_file_19862,0_1328,1
54047,simulated_name_dob_reference_file_19864,0_23720,1
54048,simulated_name_dob_reference_file_19868,0_5904,1


In [59]:
missed_links = df_ops.persist(
    census_2030_piked[census_2030_piked.pik.isnull()].drop(columns=["pik"])
        .rename(columns={"middle_initial": "middle_name"})
        .merge(census_2030_ground_truth, on="record_id")
        .merge(reference_file.rename(columns=lambda c: c.replace('mailing_address_', '')).merge(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == 1], on="record_id"), on="simulant_id", suffixes=("_census", "_reference_file"))
)

In [60]:
df_ops.compute(census_2030_piked.pik.isnull().sum())

1157

In [61]:
len(missed_links)

3160

In [62]:
simulants_missed = df_ops.head(missed_links[['simulant_id']], n=100).simulant_id.unique()
simulants_missed

array(['0_3742', '0_5271', '0_10612', '0_15332', '0_23169', '0_10229',
       '0_14071', '0_14033', '0_3219', '0_3304', '0_13133', '0_2567',
       '0_6870', '0_23113', '0_239', '0_7880', '0_14333', '0_15824',
       '0_23183', '0_7681', '0_11902', '0_19554', '0_22962', '0_10138',
       '0_19677', '0_17305', '0_7464', '0_7063', '0_7592', '0_1080',
       '0_21046', '0_16021'], dtype=object)

In [63]:
for simulant in simulants_missed[0:15]:
    print(simulant)
    display(df_ops.head(missed_links[missed_links.simulant_id == simulant][['simulant_id'] + flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])], n=100))

0_3742


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,0_3742,Tammy,Tammy,P,P,White,White,01/13/1960,,2777,2727.0,eugene st,EUGENE ST,,,Anytown,ANYTOWN,WA,WA
1,0_3742,Tammy,Tammy,P,P,White,White,01/13/1960,,2777,,eugene st,,,,Anytown,,WA,


0_5271


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
2,0_5271,Steven,Steven,R,Randy,Moreau,Moreau,11/23/1954,,34,34.0,bowen cir sw,BOWEN CIR SW,,,,ANYTOWN,WA,WA
3,0_5271,Steven,Steven,R,Randy,Moreau,Moreau,11/23/1954,,34,34.0,bowen cir sw,BOWEN CIR SW,,,,ANYTOWN,WA,
4,0_5271,Steven,Steven,R,Randy,Moreau,Moreau,11/23/1954,,34,34.0,bowen cir sw,BOWEN CIR SE,,,,ANYTOWN,WA,WA
5,0_5271,Steven,Steven,R,Randy,Moreau,Moreau,11/23/1954,,34,34.0,bowen cir sw,BOWEN CIR SW,,,,ANYTOWN,WA,WA
6,0_5271,Steven,Steven,R,Randy,Moreau,Moreau,11/23/1954,,34,,bowen cir sw,,,,,,WA,


0_10612


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
7,0_10612,Angela,Angela,S,Savannah,Asbury,Aabhry,06/12/1987,19870012,34,34.0,bowen cir sw,BOWEN CIR QW,,,Anytown,ANYTOWN,WA,WA
8,0_10612,Angela,Angela,S,Savannah,Asbury,Aabhry,06/12/1987,19870012,34,34.0,bowen cir sw,BOWEN CIR SW,,,Anytown,ANYTOWN,WA,WA
9,0_10612,Angela,Angela,S,Savannah,Asbury,Aabhry,06/12/1987,19870012,34,,bowen cir sw,,,,Anytown,,WA,


0_15332


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
10,0_15332,Tony,Anthony,R,Richard,Mesa,Mesa,01/02/1958,19580102,34,34.0,bowen cir sw,BOWEN CIR SW,,,Anytown,ANYTOWN,WA,ID
11,0_15332,Tony,Anthony,R,Richard,Mesa,Mesa,01/02/1958,19580102,34,37.0,bowen cir sw,BOWEN CIR SW,,,Anytown,ANYTOWN,WA,WA
12,0_15332,Tony,Anthony,R,Richard,Mesa,Mesa,01/02/1958,19580102,34,34.0,bowen cir sw,BOWEN CIR SW,,,Anytown,ANYTOWN,WA,WA
13,0_15332,Tony,Anthony,R,Richard,Mesa,Mesa,01/02/1958,19580102,34,,bowen cir sw,,,,Anytown,,WA,


0_23169


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
14,0_23169,Adriana,Adriana,M,Morgan,Hill,Hill,09/06/9028,20280906,34,34.0,bowen cir sw,BOWEN CIR SW,,,Anytown,ANYTOWN,WA,WA
15,0_23169,Adriana,Adriana,M,Morgan,Hill,Hill,09/06/9028,20280906,34,,bowen cir sw,,,,Anytown,,WA,


0_10229


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
16,0_10229,Barry,Barry,S,S,Fujii,Fujii,03/28/1973,,1231,1231.0,rivereide dg,RIVERSIDE DR,,,Anytown,,WA,WA
17,0_10229,Barry,Barry,S,S,Fujii,Fujii,03/28/1973,,1231,1231.0,rivereide dg,RIVERSIDE DR,,,Anytown,ANYTOWN,WA,WA
18,0_10229,Barry,Barry,S,S,Fujii,Fujii,03/28/1973,,1231,,rivereide dg,,,,Anytown,,WA,


0_14071


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
19,0_14071,Woizzbeth,Elizabeth,J,J,Ortega,Ortega,08/02/1994,,1312,1312.0,oak street,OAK STREET,,,Anytown,ANYTOWN,WA,WA
20,0_14071,Woizzbeth,Elizabeth,J,J,Ortega,Ortega,08/02/1994,,1312,,oak street,,,,Anytown,,WA,


0_14033


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
21,0_14033,Robert,Robert,C,Christopher,Creekmore,Creekmore,09/21/1964,19700419,1751,1751.0,bear claw,BEAR CLAW,,,Anytown,ANYDOWN,WA,WA
22,0_14033,Robert,Robert,C,Christopher,Creekmore,Creekmore,09/21/1964,19700419,1751,1751.0,bear claw,BEAR CLAW,,,Anytown,,WA,WA
23,0_14033,Robert,Robert,C,Christopher,Creekmore,Creekmore,09/21/1964,19700419,1751,,bear claw,,,,Anytown,,WA,


0_3219


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
24,0_3219,Christa,Christa,K,Kassidy,Connor,Connor,02/20/2408,20081727,2333,8203.0,westminster place,WEST FARWELL AVENUE,,,Anytown,,WA,WA
25,0_3219,Christa,Christa,K,Kassidy,Connor,Connor,02/20/2408,20081727,2333,8203.0,westminster place,WEST FARWELL AVENUE,,,Anytown,ANYTOWN,WA,WA
26,0_3219,Christa,Christa,K,Kassidy,Connor,Connor,02/20/2408,20081727,2333,2333.0,westminster place,WESTMINSTER PLACE,,,Anytown,ANYTOWN,WA,WA
27,0_3219,Christa,Christa,K,Kassidy,Connor,Connor,02/20/2408,20081727,2333,,westminster place,,,,Anytown,,WA,


0_3304


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
28,0_3304,Christopher,Christopher,B,Bryan,Cline,Cline,,19830904,2333,2333.0,wesgmigzted placf,WESTMINSTER PLACE,,,Anytown,ANYTOWN,WA,WA
29,0_3304,Christopher,Christopher,B,Bryan,Cline,Cline,,19830904,2333,,wesgmigzted placf,,,,Anytown,,WA,


0_13133


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
30,0_13133,,Ricky,A,Arlyn,Miller,Miller,08/22/1958,19580822,2333,2333.0,westminster place,WESTMINSTER PLACE,,,Anytown,ANYTOWN,WA,WA
31,0_13133,,Ricky,A,Arlyn,Miller,Miller,08/22/1958,19580822,2333,2333.0,westminster place,WESTMYNSDR PLASE,,,Anytown,ANYTOWN,WA,WA
32,0_13133,,Ricky,A,Arlyn,Miller,Miller,08/22/1958,19580822,2333,,westminster place,,,,Anytown,,WA,


0_2567


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
33,0_2567,Woman,Nicholas,N,Nicholas,Bignell,Bignelo,02/23/1989,19890223,,,petal way,PETAL WAY,,,4nvtown,ANYTOWN,WA,WA
34,0_2567,Woman,Nicholas,N,Nicholas,Bignell,Bignelo,02/23/1989,19890223,,,petal way,,,,4nvtown,,WA,


0_6870


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
35,0_6870,Kenneth,Kenneth,C,Charles,Harper,Harper,08/14/1974,19760929,10,10.0,bellis fair pkwy,BELLIS FAIR PKWY,,,Anytown,ANYTOWN,WA,WA
36,0_6870,Kenneth,Kenneth,C,Charles,Harper,Harper,08/14/1974,19760929,10,,bellis fair pkwy,,,,Anytown,,WA,


0_23113


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
37,0_23113,Amara,Amara,L,Leona,Matthews,Matthews,06/13/2028,19640723,3532,3532.0,mount sinai rd,MOUNT SINAI RD,,,Anytown,ANYTOWN,WA,WA
38,0_23113,Amara,Amara,L,Leona,Matthews,Matthews,06/13/2028,19640723,3532,,mount sinai rd,,,,Anytown,,WA,


0_239


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
39,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANHTOWJ,WA,WA
40,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANYTOWN,WA,AR
41,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANVTOWN,WA,WA
42,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANYTOWN,WA,WA
43,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANYTOWN,WA,WA
44,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,8203.0,west farwell avenue,WEST FARWELL AVENUE,,,Anytown,ANYTOWN,WA,WA
45,0_239,Dominique,Dominyqu,N,Noah,Moore,Moore,06/19/9006,20030619,8203,,west farwell avenue,,,,Anytown,,WA,
