# Simulated PIK statistics

Here we inspect the accuracy and characteristics of the PIKs assigned,
leveraging our knowledge of ground truth from pseudopeople.

It wouldn't be possible to do the ground truth part with the real PVS, but
Layne, Wagner, and Rothhaas did something similar by redacting SSN from real records,
sending them through PVS without the SSN, and then using the true SSN
as ground truth.
The health care records they used are probably quite different from a CUF,
but they found a **very** good overall PIK accuracy (see cell below).

In [1]:
import datetime

from vivarium_research_prl import distributed_compute, utils

In [2]:
print(datetime.datetime.now())

2024-02-12 11:58:39.901648


In [3]:
# DO NOT EDIT if this notebook is not called ground_truth_accuracy.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
data_to_use = 'small_sample'
simulated_data_output_dir = 'generate_simulated_data/output'
case_study_output_dir = 'output'

# The "compute engine" is what we use on the Python side
# for our case-study-specific operations,
# as opposed to the Splink engine
compute_engine = 'pandas'
# Only matter if using a distributed compute engine
compute_engine_num_jobs = 3
compute_engine_cpus_per_job = 2
compute_engine_memory_per_job = "5GB"

In [4]:
# Parameters
data_to_use = "usa"
simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"
compute_engine = "dask"
compute_engine_num_jobs = 50
compute_engine_memory_per_job = "120GB"
compute_engine_cpus_per_job = 2


In [5]:
# Parameters for a USA run
# data_to_use = "usa"
# simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
# case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"

# compute_engine = 'dask'
# compute_engine_num_jobs = 50
# compute_engine_memory_per_job = "120GB"
# compute_engine_cpus_per_job = 2

In [6]:
case_study_output_dir = f'{case_study_output_dir}/{data_to_use}'
simulated_data_output_dir = f'{simulated_data_output_dir}/{data_to_use}'

In [7]:
import os
from pathlib import Path
os.environ["PATH"] = f"{Path('./slurm_within_singularity').resolve()}:{os.environ['PATH']}"

In [8]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=compute_engine_num_jobs,
    cpus_per_job=compute_engine_cpus_per_job,
    memory_per_job=compute_engine_memory_per_job,
)

Jiggling the cluster


0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.111.17:8787/status,

0,1
Dashboard: http://10.158.111.17:8787/status,Workers: 50
Total threads: 50,Total memory: 5.46 TiB

0,1
Comm: tcp://10.158.111.17:44657,Workers: 50
Dashboard: http://10.158.111.17:8787/status,Total threads: 50
Started: 3 minutes ago,Total memory: 5.46 TiB

0,1
Comm: tcp://10.158.148.32:34861,Total threads: 1
Dashboard: http://10.158.148.32:46437/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.32:39699,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6s5g5ita,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6s5g5ita

0,1
Comm: tcp://10.158.147.185:43127,Total threads: 1
Dashboard: http://10.158.147.185:37309/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.185:45841,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-99lukjj7,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-99lukjj7

0,1
Comm: tcp://10.158.148.10:40137,Total threads: 1
Dashboard: http://10.158.148.10:37505/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.10:44871,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-hf93cfts,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-hf93cfts

0,1
Comm: tcp://10.158.148.24:43271,Total threads: 1
Dashboard: http://10.158.148.24:33929/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.24:43055,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-77933g7e,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-77933g7e

0,1
Comm: tcp://10.158.147.211:45969,Total threads: 1
Dashboard: http://10.158.147.211:34009/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.211:39271,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4jvpeluf,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4jvpeluf

0,1
Comm: tcp://10.158.148.24:37689,Total threads: 1
Dashboard: http://10.158.148.24:34953/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.24:35043,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ybjhw_i6,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ybjhw_i6

0,1
Comm: tcp://10.158.111.17:41651,Total threads: 1
Dashboard: http://10.158.111.17:45949/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.17:37373,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ykqucfq3,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ykqucfq3

0,1
Comm: tcp://10.158.148.23:39303,Total threads: 1
Dashboard: http://10.158.148.23:41311/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.23:33991,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-19pzhyaq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-19pzhyaq

0,1
Comm: tcp://10.158.147.248:43879,Total threads: 1
Dashboard: http://10.158.147.248:33285/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.248:38613,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-28b41b4m,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-28b41b4m

0,1
Comm: tcp://10.158.148.23:39587,Total threads: 1
Dashboard: http://10.158.148.23:45133/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.23:46581,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wa6qhs5y,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wa6qhs5y

0,1
Comm: tcp://10.158.106.62:45573,Total threads: 1
Dashboard: http://10.158.106.62:34949/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.62:40613,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-eybwkxuu,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-eybwkxuu

0,1
Comm: tcp://10.158.148.23:36477,Total threads: 1
Dashboard: http://10.158.148.23:33209/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.23:45605,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tv914eav,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tv914eav

0,1
Comm: tcp://10.158.147.141:32773,Total threads: 1
Dashboard: http://10.158.147.141:42789/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.141:42207,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-k2ne4wqi,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-k2ne4wqi

0,1
Comm: tcp://10.158.111.18:34081,Total threads: 1
Dashboard: http://10.158.111.18:42005/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.18:45783,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fqg2z41a,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fqg2z41a

0,1
Comm: tcp://10.158.106.10:36397,Total threads: 1
Dashboard: http://10.158.106.10:38245/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.10:36929,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-y9p4q9yn,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-y9p4q9yn

0,1
Comm: tcp://10.158.147.178:46421,Total threads: 1
Dashboard: http://10.158.147.178:40491/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.178:36055,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tuyc0qhj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tuyc0qhj

0,1
Comm: tcp://10.158.148.10:40541,Total threads: 1
Dashboard: http://10.158.148.10:44297/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.10:46251,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dhysaht8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dhysaht8

0,1
Comm: tcp://10.158.106.62:33135,Total threads: 1
Dashboard: http://10.158.106.62:36307/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.62:37079,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fyyzdnda,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fyyzdnda

0,1
Comm: tcp://10.158.111.40:42031,Total threads: 1
Dashboard: http://10.158.111.40:46471/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.40:35569,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jbtemlgj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jbtemlgj

0,1
Comm: tcp://10.158.147.200:46129,Total threads: 1
Dashboard: http://10.158.147.200:44921/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.200:42329,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-i8o455ip,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-i8o455ip

0,1
Comm: tcp://10.158.147.216:33689,Total threads: 1
Dashboard: http://10.158.147.216:45957/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.216:37599,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qeyvw3lh,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qeyvw3lh

0,1
Comm: tcp://10.158.148.10:44941,Total threads: 1
Dashboard: http://10.158.148.10:38337/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.10:45317,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bhr9gnr1,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bhr9gnr1

0,1
Comm: tcp://10.158.106.24:41407,Total threads: 1
Dashboard: http://10.158.106.24:45125/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.24:42873,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-myw0p1f6,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-myw0p1f6

0,1
Comm: tcp://10.158.148.54:37111,Total threads: 1
Dashboard: http://10.158.148.54:39865/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.54:36219,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0wqlyigk,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0wqlyigk

0,1
Comm: tcp://10.158.106.37:41109,Total threads: 1
Dashboard: http://10.158.106.37:38089/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.37:46451,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-h3gl2hj9,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-h3gl2hj9

0,1
Comm: tcp://10.158.148.25:36005,Total threads: 1
Dashboard: http://10.158.148.25:40161/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.25:37645,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-t7r8m5a2,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-t7r8m5a2

0,1
Comm: tcp://10.158.106.24:35843,Total threads: 1
Dashboard: http://10.158.106.24:36983/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.24:41575,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wzxnz3uh,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wzxnz3uh

0,1
Comm: tcp://10.158.100.143:42293,Total threads: 1
Dashboard: http://10.158.100.143:38425/status,Memory: 111.76 GiB
Nanny: tcp://10.158.100.143:42289,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bfnrtzjz,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bfnrtzjz

0,1
Comm: tcp://10.158.147.199:42691,Total threads: 1
Dashboard: http://10.158.147.199:36655/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.199:45151,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0splkf7h,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0splkf7h

0,1
Comm: tcp://10.158.148.174:36363,Total threads: 1
Dashboard: http://10.158.148.174:35005/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.174:37677,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b3ik8wmj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b3ik8wmj

0,1
Comm: tcp://10.158.106.31:45925,Total threads: 1
Dashboard: http://10.158.106.31:33995/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.31:44763,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kyc18cmv,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kyc18cmv

0,1
Comm: tcp://10.158.106.24:32839,Total threads: 1
Dashboard: http://10.158.106.24:45759/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.24:39079,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-llf6mpfu,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-llf6mpfu

0,1
Comm: tcp://10.158.148.230:41005,Total threads: 1
Dashboard: http://10.158.148.230:36535/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.230:33411,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bcmmmlp8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bcmmmlp8

0,1
Comm: tcp://10.158.147.196:36361,Total threads: 1
Dashboard: http://10.158.147.196:36281/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.196:33385,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-483cezk0,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-483cezk0

0,1
Comm: tcp://10.158.106.28:46107,Total threads: 1
Dashboard: http://10.158.106.28:34925/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.28:38521,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-nep1x0j5,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-nep1x0j5

0,1
Comm: tcp://10.158.106.11:46617,Total threads: 1
Dashboard: http://10.158.106.11:42409/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.11:46267,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dbec33ge,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dbec33ge

0,1
Comm: tcp://10.158.106.24:39191,Total threads: 1
Dashboard: http://10.158.106.24:40159/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.24:33443,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ztnv1ps0,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ztnv1ps0

0,1
Comm: tcp://10.158.148.54:37903,Total threads: 1
Dashboard: http://10.158.148.54:35755/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.54:36835,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ng7h9xn6,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ng7h9xn6

0,1
Comm: tcp://10.158.147.170:41057,Total threads: 1
Dashboard: http://10.158.147.170:41677/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.170:46741,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-7og_s33b,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-7og_s33b

0,1
Comm: tcp://10.158.106.12:35605,Total threads: 1
Dashboard: http://10.158.106.12:40507/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.12:39385,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-p9jp0ogm,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-p9jp0ogm

0,1
Comm: tcp://10.158.111.18:33273,Total threads: 1
Dashboard: http://10.158.111.18:34803/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.18:36395,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-e9kj0gel,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-e9kj0gel

0,1
Comm: tcp://10.158.106.12:42421,Total threads: 1
Dashboard: http://10.158.106.12:40233/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.12:42985,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-69r_70g2,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-69r_70g2

0,1
Comm: tcp://10.158.111.18:37667,Total threads: 1
Dashboard: http://10.158.111.18:39053/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.18:42607,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mdz5_9a8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mdz5_9a8

0,1
Comm: tcp://10.158.148.23:35357,Total threads: 1
Dashboard: http://10.158.148.23:37391/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.23:45463,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mwfcwxwj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mwfcwxwj

0,1
Comm: tcp://10.158.111.18:36153,Total threads: 1
Dashboard: http://10.158.111.18:35565/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.18:42863,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-luxam16f,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-luxam16f

0,1
Comm: tcp://10.158.106.9:37205,Total threads: 1
Dashboard: http://10.158.106.9:44041/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.9:44295,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-g_8nwvr6,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-g_8nwvr6

0,1
Comm: tcp://10.158.106.11:39309,Total threads: 1
Dashboard: http://10.158.106.11:41331/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.11:45111,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pcygt5je,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pcygt5je

0,1
Comm: tcp://10.158.148.10:41817,Total threads: 1
Dashboard: http://10.158.148.10:38779/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.10:43931,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dzvuzw8p,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dzvuzw8p

0,1
Comm: tcp://10.158.148.230:46841,Total threads: 1
Dashboard: http://10.158.148.230:35105/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.230:37647,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dyjpmfy9,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dyjpmfy9

0,1
Comm: tcp://10.158.111.17:36091,Total threads: 1
Dashboard: http://10.158.111.17:41361/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.17:44777,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d9s9cxsg,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d9s9cxsg


In [9]:
census_2030_piked = df_ops.read_parquet(f'{case_study_output_dir}/census_2030_piked.parquet')
confirmed_piks_with_ground_truth = df_ops.read_parquet(f'{case_study_output_dir}/confirmed_piks.parquet')

In [10]:
piked_proportion = df_ops.compute(census_2030_piked.pik.notnull().mean())
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

83.32% of the input records were PIKed


In [11]:
# Multiple Census rows assigned the same PIK, indicating the model thinks they are duplicates in Census
pik_sizes = df_ops.persist(df_ops.groupby_agg_small_groups(census_2030_piked, by='pik', agg_func=lambda x: x.size()))
df_ops.compute(pik_sizes.value_counts())

1    284822890
2       171358
3          234
4            2
Name: count, dtype: int64

In [12]:
# Interesting: in pseudopeople, sometimes siblings are assigned the same (common) first name, making them almost identical.
# The only giveaway is their age and DOB.
# Presumably, this tends not to happen in real life.
duplicate_piks = pik_sizes.rename('pik_size').reset_index().pipe(lambda df: df[df.pik_size > 1])

df_ops.head(census_2030_piked.merge(duplicate_piks, on="pik")).sort_values('pik')

Unnamed: 0,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,record_id,pik,pik_size
4,3825_170338,Jason,W,Swager,45,11/13/1984,5331,windridge dr,,statesville,NC,28516.0,Standard,Child-in-law,Male,White,2030,simulated_census_2030_112_356676,258_521074,2
5,3825_170338,Javion,C,Swager,19,11/13/1984,5331,windridge dr,,statesville,NC,28516.0,Standard,Grandchild,Male,White,2030,simulated_census_2030_112_356674,258_521074,2
0,1356_349913,Gladys,T,Trlica,53,09/25/1975,256,glencoe st,,detroit,MI,49009.0,Standard,Reference person,Female,Black,2030,simulated_census_2030_17_733842,299_1339634,2
1,1356_349913,Christopher,T,Trlica,54,09/25/1975,256,glencoe st,,detroit,MI,,Standard,Opp-sex spouse,Male,White,2030,simulated_census_2030_17_733843,299_1339634,2
2,465_208992,Judy,B,Smith,75,03/11/1955,2181,e southport rd,,miami beach,FL,33612.0,Standard,Reference person,Female,Asian,2030,simulated_census_2030_140_438145,325_1010988,2
3,2054_90146,Judy,B,Smith,75,03/11/1955,7418,quincy place n wst,,southaven,MS,39194.0,Standard,Opp-sex spouse,Female,White,2030,simulated_census_2030_43_189723,325_1010988,2
6,9804_42708,James,M,Jones,1,07/09/2028,404,20 road,,cooper city,FL,33126.0,Standard,Biological child,,White,2030,simulated_census_2030_294_2019080,399_716729,2
7,9804_42708,Lucas,,Jones,1,07/09/2028,404,20 road,,cooper city,FL,33126.0,Standard,Biological child,Male,White,2030,simulated_census_2030_294_2019853,399_716729,2
8,1299_379876,Meredith,A,Lady,48,02/22/1982,3831,harris cors py,apt g 2,rattlesnake bead,TX,75217.0,Standard,Biological child,Female,Latino,2030,simulated_census_2030_15_795868,417_1979367,2
9,1299_379876,Liberty,A,Jones,11,02/22/1982,3831,harris cors py,apt g 2,rattlesnake bead,TX,75217.0,Standard,Grandchild,Female,Latino,2030,simulated_census_2030_15_795869,417_1979367,2


## Ground truth statistics

In [13]:
census_2030_ground_truth = df_ops.persist(
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_census_2030_ground_truth.parquet')
)

In [14]:
# In this version of pseudopeople, there are no actual duplicates in Census,
# which means all of the duplicates identified above are wrong.
assert len(census_2030_ground_truth) == len(df_ops.drop_duplicates(census_2030_ground_truth))

In [15]:
reference_files_ground_truth = df_ops.persist(df_ops.concat([
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_geobase_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_name_dob_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
], ignore_index=True))

In [16]:
# However, there can be reference file records that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(reference_files_ground_truth, by='record_id', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    1273616714
2      99633649
3       1656365
4         92271
5          5229
6           297
7            13
8             1
9             1
Name: count, dtype: int64

In [17]:
reference_files_ground_truth = df_ops.persist(reference_files_ground_truth.merge(
    n_unique_simulants,
    on='record_id',
    how='left',
))
reference_files_ground_truth.head(n=100)

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_129_1847623,1482_703440,1
1,simulated_geobase_reference_file_129_1847917,5831_761663,1
2,simulated_geobase_reference_file_129_1848059,4941_993745,1
3,simulated_geobase_reference_file_129_1848093,821_1104237,1
4,simulated_geobase_reference_file_129_1848812,6487_136334,1
...,...,...,...
95,simulated_geobase_reference_file_129_1883132,40_1043771,1
96,simulated_geobase_reference_file_129_1883293,8425_565805,1
97,simulated_geobase_reference_file_129_1884248,6606_988274,1
98,simulated_geobase_reference_file_129_1884373,9723_324974,1


In [18]:
df_ops.head(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == df_ops.compute(reference_files_ground_truth.n_unique_simulants.max())])

Unnamed: 0,record_id,simulant_id,n_unique_simulants
1915397,simulated_geobase_reference_file_124_1197262,6554_95393,9
1915398,simulated_geobase_reference_file_124_1197262,6554_95390,9
1915399,simulated_geobase_reference_file_124_1197262,6554_95392,9
1915400,simulated_geobase_reference_file_124_1197262,6554_95403,9
1915401,simulated_geobase_reference_file_124_1197262,6554_95397,9
1915402,simulated_geobase_reference_file_124_1197262,6554_95395,9
1915403,simulated_geobase_reference_file_124_1197262,6554_95396,9
1915404,simulated_geobase_reference_file_124_1197262,6554_95389,9
1915405,simulated_geobase_reference_file_124_1197262,6554_95406,9


In [19]:
census_2030_ground_truth = df_ops.persist(census_2030_ground_truth.merge(
    df_ops.drop_duplicates(reference_files_ground_truth[['simulant_id']]).assign(possible_to_pik=1),
    on='simulant_id',
    how='left',
).assign(possible_to_pik=lambda df: df.possible_to_pik.fillna(0)))
possible_to_pik_proportion = df_ops.compute(census_2030_ground_truth.possible_to_pik.mean())
print(
    f'{(1 - possible_to_pik_proportion):.2%} of the input records are '
    'impossible to PIK correctly, since they are not in any reference files'
)

0.43% of the input records are impossible to PIK correctly, since they are not in any reference files


In [20]:
print(
    f'Assigned PIKs to {(piked_proportion / possible_to_pik_proportion):.2%} of PIK-able records'
)

Assigned PIKs to 83.68% of PIK-able records


In [21]:
reference_file = df_ops.concat([
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_geobase_reference_file.parquet',
    ),
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_name_dob_reference_file.parquet',
    ),
], ignore_index=True)

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    1.000000e+03
mean     3.207876e+08
std      1.413723e+08
min      1.787488e+08
25%      1.794831e+08
50%      3.187717e+08
75%      4.622279e+08
max      4.652384e+08
dtype: float64
Creating partitions of 6,416MB


In [22]:
reference_file_piks = df_ops.persist(reference_file[['record_id', 'pik']])
reference_file_piks

Unnamed: 0_level_0,record_id,pik
npartitions=53,Unnamed: 1_level_1,Unnamed: 2_level_1
,large_string[pyarrow],large_string[pyarrow]
,...,...
...,...,...
,...,...
,...,...


In [23]:
assert len(reference_file_piks) == len(df_ops.drop_duplicates(reference_file_piks[['record_id']]))

In [24]:
pik_simulant_pairs = df_ops.persist(df_ops.drop_duplicates(reference_files_ground_truth.merge(reference_file_piks, on='record_id')[['pik', 'simulant_id']]))

In [25]:
# However, there can be PIKs that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(pik_simulant_pairs, by='pik', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1      322477198
2       67417553
3        7975921
4         704055
5          53133
         ...    
324            1
114            1
148            1
94             1
144            1
Name: count, Length: 323, dtype: int64

In [26]:
pik_simulant_pairs = df_ops.persist(pik_simulant_pairs.merge(
    n_unique_simulants,
    on='pik',
    how='left',
))
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [27]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == df_ops.compute(pik_simulant_pairs.n_unique_simulants.max())])

Unnamed: 0,pik,simulant_id,n_unique_simulants
100347,384_372037,6817_872601,324
100348,384_372037,8641_806568,324
100349,384_372037,5300_822321,324
100350,384_372037,9772_893626,324
100351,384_372037,5159_620211,324
100352,384_372037,8425_779308,324
100353,384_372037,5398_1010841,324
100354,384_372037,93_808743,324
100355,384_372037,5901_863651,324
100356,384_372037,131_950519,324


## Definitions of accuracy

1. (most strict) Assigning any PIK with multiple simulants is incorrect
2. Assigning a PIK with multiple simulants is neither incorrect nor correct (excluded from denominator)
3. (most lenient) Assigning a PIK with multiple simulants is correct, as long as at least one of those simulants matches the truth

In [28]:
# All modules, Medicare database, calculated from Layne, Wagner, and Rothhaas Table 1 (p. 15)
real_life_pvs_accuracy = 1 - (2_585 + 60_709 + 129_480 + 89_094) / (52_406_981 + 5_170_924 + 49_374_794 + 50_327_034)
f'{real_life_pvs_accuracy:.5%}'

'99.82079%'

### Definition 1

In [29]:
piks_assigned = df_ops.compute(census_2030_piked.pik.notnull().sum())
piks_assigned

285166316

In [30]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants > 1])

Unnamed: 0,pik,simulant_id,n_unique_simulants
2,316_181946,9247_275623,2
3,316_181946,3167_693193,2
9,316_1820962,9971_964124,2
10,316_1820962,5440_947557,2
11,316_1822717,8509_747183,2
12,316_1822717,99_825115,2
25,316_1829481,1609_337270,2
26,316_1829481,1609_337271,2
28,316_1830405,6800_690917,2
29,316_1830405,6800_969214,2


In [31]:
single_sim_piks_correct = df_ops.compute(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_piks_correct

228477295

In [32]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / piks_assigned
)

0.8012071629105031

In [33]:
assert len(confirmed_piks_with_ground_truth) == piks_assigned

In [34]:
df_ops.head(census_2030_ground_truth.rename(columns={'record_id': 'record_id_census_2030'}))

Unnamed: 0,record_id_census_2030,simulant_id,possible_to_pik
0,simulated_census_2030_29_279,1614_321,1.0
1,simulated_census_2030_29_363,1614_419,1.0
2,simulated_census_2030_29_1015,1614_1184,1.0
3,simulated_census_2030_29_1234,1614_1439,1.0
4,simulated_census_2030_29_1389,1614_1616,1.0
5,simulated_census_2030_29_1509,1614_1755,1.0
6,simulated_census_2030_29_2762,1614_3245,1.0
7,simulated_census_2030_29_3479,1614_4103,1.0
8,simulated_census_2030_29_3490,1614_4114,1.0
9,simulated_census_2030_29_3691,1614_4357,1.0


In [35]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_correct = df_ops.compute(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_record_links_correct

267374805

In [36]:
(
    single_sim_record_links_correct / piks_assigned
)

0.9376100541972846

### Definition 2

In [37]:
single_sim_piks_assigned = len(census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == 1][['pik', 'simulant_id']]))
single_sim_piks_assigned

228633754

In [38]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / single_sim_piks_assigned
)

0.9993156784715174

In [39]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_assigned = df_ops.compute(
    (confirmed_piks_with_ground_truth
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .n_unique_simulants == 1).sum()
)
single_sim_record_links_assigned

267560471

In [40]:
(
    single_sim_record_links_correct / single_sim_record_links_assigned
)

0.9993060783631227

### Definition 3

In [41]:
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [42]:
piks_at_least_partially_correct = df_ops.persist(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(df_ops.drop_duplicates)
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id", "pik"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id,pik,correct
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],bool[pyarrow]
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [43]:
# Overall accuracy, treating it as a black box
piks_correct_proportion = (df_ops.compute(piks_at_least_partially_correct.correct.sum()) / piks_assigned)
piks_correct_proportion

0.9993226198566875

In [44]:
print(f'{piks_correct_proportion:.5%} of the PIKs assigned were correct; compare with {real_life_pvs_accuracy:.5%} in real life')

99.93226% of the PIKs assigned were correct; compare with 99.82079% in real life


In [45]:
# Looking at whether the exact *record* linked was from the same simulant
sim_record_links_at_least_partially_correct = df_ops.persist(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id_raw_input_file", "record_id_reference_file", "pik", "module_name", "pass_name"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
sim_record_links_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file,pik,module_name,pass_name,correct
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],bool[pyarrow]
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [46]:
len(sim_record_links_at_least_partially_correct)

285166316

In [47]:
len(sim_record_links_at_least_partially_correct[['record_id_raw_input_file', 'record_id_reference_file']].drop_duplicates())

285166316

In [48]:
(
    df_ops.compute(sim_record_links_at_least_partially_correct.correct.sum()) / piks_assigned
)

0.9993195619920271

In [49]:
assert df_ops.compute((df_ops.groupby_agg_small_groups(confirmed_piks_with_ground_truth, by='record_id_raw_input_file', agg_func=lambda x: x.record_id_reference_file.nunique()) <= 1).all())

In [50]:
# Using definition 3 -- at the PIK level
piks_at_least_partially_correct = df_ops.persist(
    piks_at_least_partially_correct
        .rename(columns={'record_id': 'record_id_raw_input_file'})
        .merge(confirmed_piks_with_ground_truth[['record_id_raw_input_file', 'module_name', 'pass_name']], on='record_id_raw_input_file')
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,pik,correct,module_name,pass_name
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,large_string[pyarrow],large_string[pyarrow],bool[pyarrow],large_string[pyarrow],large_string[pyarrow]
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [51]:
# Accuracy by module -- note that this shows the opposite pattern (with the sample data)
# relative to the results of Layne et al., who found GeoSearch was much *more* accurate
df_ops.compute(piks_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.0,284
namesearch,0.996635,13179350
geosearch,0.999452,270820589
hhcompsearch,0.999984,1166093


In [52]:
# Accuracy by pass -- could be used to tune pass-specific cutoffs, but
# this might not be too informative while we are still using the sample data.
df_ops.compute(piks_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.0,284
geosearch,geokey name switch,0.908933,1153
namesearch,DOB and NYSIIS of name,0.996633,13172408
geosearch,geokey,0.999323,217062218
geosearch,some name and DOB information,0.999969,37204075
geosearch,house number and street name Soundex,0.999974,16553079
hhcompsearch,initials,0.999983,1036497
hhcompsearch,year of birth,0.999992,129596
namesearch,DOB and initials,1.0,6942
geosearch,house number and street name Soundex name switch,1.0,64


In [53]:
# Using definition 3 -- at the link level
df_ops.compute(sim_record_links_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.0,284
namesearch,0.996634,13179350
geosearch,0.999448,270820589
hhcompsearch,0.999983,1166093


In [54]:
df_ops.compute(sim_record_links_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.0,284
geosearch,geokey name switch,0.908933,1153
namesearch,DOB and NYSIIS of name,0.996633,13172408
geosearch,geokey,0.99932,217062218
geosearch,some name and DOB information,0.999969,37204075
geosearch,house number and street name Soundex,0.999974,16553079
hhcompsearch,initials,0.999982,1036497
hhcompsearch,year of birth,0.999992,129596
namesearch,DOB and initials,1.0,6942
geosearch,house number and street name Soundex name switch,1.0,64


In [55]:
df_ops.compute(sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct].groupby(["module_name", "pass_name"]).size()).sort_values()

module_name   pass_name                           
hhcompsearch  year of birth                                1
              initials                                    19
geosearch     geokey name switch                         105
dobsearch     initials name switch                       284
geosearch     house number and street name Soundex       428
              some name and DOB information             1140
namesearch    DOB and NYSIIS of name                   44356
geosearch     geokey                                  147705
dtype: int64

### Incorrect and missed PIKs

In [56]:
incorrectly_linked_pairs = df_ops.drop_duplicates(
    sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct]
        [["record_id_raw_input_file", "record_id_reference_file"]]
)
incorrectly_linked_pairs

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1
,large_string[pyarrow],large_string[pyarrow]
,...,...
...,...,...
,...,...
,...,...


In [57]:
len(incorrectly_linked_pairs)

194038

In [58]:
comparison_cols = [
    "first_name",
    "middle_name",
    "last_name",
    "date_of_birth",
    "street_number",
    "street_name",
    "unit_number",
    "city",
    "state",
]

incorrect_links = (
    incorrectly_linked_pairs
        .merge(
            census_2030_piked
                .rename(columns={"record_id": "record_id_raw_input_file", "middle_initial": "middle_name"})
                [["record_id_raw_input_file"] + comparison_cols],
            on="record_id_raw_input_file",
            how="left",
        )
        .merge(
            reference_file
                .rename(columns={"record_id": "record_id_reference_file"})
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                [["record_id_reference_file"] + comparison_cols],
            on="record_id_reference_file",
            how="left",
            suffixes=("_census", "_reference_file"),
        )
)
def flatten(xss):
    return [x for xs in xss for x in xs]

df_ops.head(incorrect_links[flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])])

Unnamed: 0,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,Alyson,Lawrence,C,,Hartman,Hartman,02/02/1944,19440202,1265.0,1265.0,sugar meadow ct,SUGAR MEADOW CT,,,lawrence,LAWRENCE,MA,MA
1,Zyaire,Lindsey,E,Emmy,Jeffery,Jeffery,06/10/1987,19870610,123.0,123.0,gunbarrel rd,GUNBARREL RD,,,south-hanlon,SOUTH-HANLON,TX,TX
2,Gabriel,Gabriel,J,Joel,Jackson,Jackson,12/04/2018,20181204,687.0,,cassata court,,,,quilcene,,WA,
3,Shannon,Eve,M,Mackenzie,Reyes,Reyes,08/29/2026,20260829,14008.0,14008.0,curtis,CURTIS,,,cocoa,COCOA,FL,FL
4,Noah,Michaela,A,Ashleigh,Wisniewski,Wisniewski,09/07/1998,19980907,225.0,225.0,berryhill ln,BERRYHILL LN,,,aurora,AURORA,IL,IL
5,Axel,Hazel,,Alexis,Barrera-Dominguez,Barrera-Dominguez,09/28/2021,20210928,,,lola drive,LOLA DRIVE,,,florham park,FLORHAM PARK,NJ,NJ
6,Jackson,,N,Alexander,Noel,Noel,04/21/2027,20270421,1816.0,1816.0,daniel boone rd,DANIEL BOONE RD,,,memphis,MEMPHIS,TN,TN
7,Sarah,Avery,E,Elias,Roach,Roach,05/08/2019,20190508,8834.0,8834.0,maple hill rd,MAPLE HILL RD,,,new york,NEW YORK,NY,NY
8,Courtney,Courtney,A,Alexis,Brown,Brown,11/15/1988,19881115,707.0,,w alameda ave,,,,warwick,,,
9,Jessica,Jessica,K,Kelsey,Knox,Knox,03/11/2000,20000311,4045.0,,s nicklaus dr,,,,fort lauderdale,,FL,


In [59]:
reference_files_ground_truth

Unnamed: 0_level_0,record_id,simulant_id,n_unique_simulants
npartitions=500,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [60]:
missed_links = df_ops.persist(
    census_2030_piked[census_2030_piked.pik.isnull()].drop(columns=["pik"])
        .rename(columns={"middle_initial": "middle_name"})
        .merge(census_2030_ground_truth, on="record_id")
        .merge(reference_file.rename(columns=lambda c: c.replace('mailing_address_', '')).merge(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == 1], on="record_id"), on="simulant_id", suffixes=("_census", "_reference_file"))
)

In [61]:
df_ops.compute(census_2030_piked.pik.isnull().sum())

57096189

In [62]:
len(missed_links)

172290639

In [63]:
simulants_missed = df_ops.head(missed_links[['simulant_id']], n=100).simulant_id.unique()
simulants_missed

<ArrowExtensionArray>
[ '2311_717009',  '1933_165980', '1933_1096662',  '1534_760044',
 '1182_1126955',  '1154_282908', '3585_1118058', '1990_1173303',
   '3481_99949', '1667_1064939',  '2965_878474',  '2989_257561',
 '3298_1053751',  '2863_347199',  '3776_972058',  '3541_836512',
 '3541_1086727',  '3465_372954',   '323_286647',  '3380_442456',
  '2721_785995',  '5188_967272', '2599_1032904',  '2787_574160',
 '3607_1002618',  '5670_468489',   '4192_87419',   '496_635934',
  '3825_278063',  '4938_693012',  '4203_549302',  '4950_524699',
  '5892_861598',  '5892_962145']
Length: 34, dtype: large_string[pyarrow]

In [64]:
for simulant in simulants_missed[0:15]:
    print(simulant)
    display(df_ops.head(missed_links[missed_links.simulant_id == simulant][['simulant_id'] + flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])], n=100))

2311_717009


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,2311_717009,Kvy,Ivy,E,Esther,Zepeda-Salas,Zepeda-Salas,12/04/2018,20181204,487,,mcgee st,,,,sulivan,,NY,
1,2311_717009,Kvy,Ivy,E,Esther,Zepeda-Salas,Zepeda-Salas,12/04/2018,20181204,487,487.0,mcgee st,MCGEE ST,,,sulivan,SULLIVAN,NY,NY
2,2311_717009,Kvy,Ivy,E,Esther,Zepeda-Salas,Zepeda-Salas,12/04/2018,20181204,487,158.0,mcgee st,BLYTHE ST,,,sulivan,SAN FRANCISCO,NY,CA


1933_165980


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
3,1933_165980,Jasmin,Jasmin,N,N,,Ta,08/16/1985,,1001,1001.0,mountain brook wy northwest,MOUGTAIN BROOK WY NORTHWEST,,,fresno,FRESNO,CA,CA
4,1933_165980,Jasmin,Jasmin,N,N,,Ta,08/16/1985,,1001,1001.0,mountain brook wy northwest,MOUNTAIN BROOK WY NORTHWEST,,,fresno,FRESNO,CA,CA
5,1933_165980,Jasmin,Jasmin,N,N,,Ta,08/16/1985,,1001,,mountain brook wy northwest,,,,fresno,,CA,


1933_1096662


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
6,1933_1096662,Zain,Zain,I,Itzae,Whute,Ehite,07/13/2024,20240713,761,,franklin av,,,,ardmore,,OK,
7,1933_1096662,Zain,Zain,I,Itzae,Whute,Ehite,07/13/2024,20240713,761,711.0,franklin av,FRANKLIN AV,,,ardmore,,OK,OK
8,1933_1096662,Zain,Zain,I,Itzae,Whute,Ehite,07/13/2024,20240713,761,711.0,franklin av,FRANKLIN AV,,,ardmore,ARDMORE,OK,OK


1534_760044


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
9,1534_760044,Friend,Eileen,E,Ethel,Schick,Schick,12/21/1945,15451221,121,121.0,n lewis ave,N LEWIS AVE,ap 44,,columbia,COLUMBIA,MO,MO
10,1534_760044,Friend,Eileen,E,Ethel,Schick,Schick,12/21/1945,15451221,121,785.0,n lewis ave,HICKMAN ROAD,ap 44,,columbia,METAIRIE,MO,LA
11,1534_760044,Friend,Eileen,E,Ethel,Schick,Schick,12/21/1945,15451221,121,121.0,n lewis ave,N LEWIS AVE,ap 44,AP 44,columbia,COLUMBIA,MO,MO
12,1534_760044,Friend,Eileen,E,Ethel,Schick,Schick,12/21/1945,15451221,121,,n lewis ave,,ap 44,,columbia,,MO,


1182_1126955


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
13,1182_1126955,David,David,J,Joseph,Secker,Secker,03/31/2026,20240311,2633,2633.0,hastings wy,HASTINGS WY,,,gwynn oalc,GWYNN OAK,MD,MD
14,1182_1126955,David,David,J,Joseph,Secker,Secker,03/31/2026,20240311,2633,,hastings wy,,,,gwynn oalc,,MD,


1154_282908


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
15,1154_282908,Michael,Mlichael,C,Colton,Hickey,Of The Home,10/25/1990,19901025,307,,s camino real,,,,chesapeake,,VA,
16,1154_282908,Michael,Mlichael,C,Colton,Hickey,Of The Home,10/25/1990,19901025,307,1404.0,s camino real,SE 322ND ST,,,chesapeake,DOWNERS GROVE,VA,IL
17,1154_282908,Michael,Mlichael,C,Colton,Hickey,Of The Home,10/25/1990,19901025,307,8839.0,s camino real,S HAMILTON RD,,,chesapeake,NOUNT ARLINGTON,VA,NJ
18,1154_282908,Michael,Mlichael,C,Colton,Hickey,Of The Home,10/25/1990,19901025,307,1404.0,s camino real,SE 322ND ST,,,chesapeake,DOWNERS GROVE,VA,IL
19,1154_282908,Michael,Mlichael,C,Colton,Hickey,Of The Home,10/25/1990,19901025,307,307.0,s camino real,S CAMINO REAL,,,chesapeake,CHESAPEAKE,VA,VA


3585_1118058


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
20,3585_1118058,Jorge,Jorge,J,J,Gosa,Gosa,07/16/2008,,10144,10144.0,sussex rd,SUSSEX RD,,,nez perce indian reservation,,ID,ID
21,3585_1118058,Jorge,Jorge,J,J,Gosa,Gosa,07/16/2008,,10144,1772.0,sussex rd,S DRAKE ST,,,nez perce indian reservation,LAWNTON,ID,PA
22,3585_1118058,Jorge,Jorge,J,J,Gosa,Gosa,07/16/2008,,10144,10144.0,sussex rd,,,,nez perce indian reservation,NEZ PERCE INDIAN RESERVATION,ID,ID
23,3585_1118058,Jorge,Jorge,J,J,Gosa,Gosa,07/16/2008,,10144,,sussex rd,,,,nez perce indian reservation,,ID,


1990_1173303


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
24,1990_1173303,Evelyn,Evelyn,P,Poppy,Fountain,Fountain,11/06/2028,20370813,,,perry brook court,,,,holton,,MI,
25,1990_1173303,Evelyn,Evelyn,P,Poppy,Fountain,Fountain,11/06/2028,20370813,,,perry brook court,,,,holton,,MI,


3481_99949


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
26,3481_99949,Naeem,Naeem,J,Jereme,Clayton,Clayton,08/13/2017,20111118,,818.0,us rte 33,67TH PLACE NORTH,,,west sta rosa,BISMARCK,CA,ND
27,3481_99949,Naeem,Naeem,J,Jereme,Clayton,Clayton,08/13/2017,20111118,,,us rte 33,US RTE 33,,,west sta rosa,WEST STA ROSA,CA,CA
28,3481_99949,Naeem,Naeem,J,Jereme,Clayton,Clayton,08/13/2017,20111118,,,us rte 33,,,,west sta rosa,,CA,


1667_1064939


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
29,1667_1064939,Laney,Laney,J,Jaylani,Raseleis,Racelis,09/04/2027,20220908,2729,,dujncan aven sq,,aptmnt 15,,s ramon,,CA,
30,1667_1064939,Laney,Laney,J,Jaylani,Raseleis,Racelis,09/04/2027,20220908,2729,8100.0,dujncan aven sq,SW 110TH PL,aptmnt 15,,s ramon,NEW BALTIMORE,CA,MI
31,1667_1064939,Laney,Laney,J,Jaylani,Raseleis,Racelis,09/04/2027,20220908,2729,2729.0,dujncan aven sq,DUNCAN AVEN SW,aptmnt 15,APTMNT 15,s ramon,S RAMON,CA,CA


2965_878474


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
32,2965_878474,Devin,Grandchild,R,Eduardo,Romero Gonzales,Romero Gonzales,12/15/2001,20011215,5011,5011.0,ashlyn ct,ASHLYN CT,,,soldotna,SOLDOTNA,AK,AK
33,2965_878474,Devin,Grandchild,R,Eduardo,Romero Gonzales,Romero Gonzales,12/15/2001,20011215,5011,,ashlyn ct,,,,soldotna,,AK,
34,2965_878474,Devin,Grandchild,R,Eduardo,Romero Gonzales,Romero Gonzales,12/15/2001,20011215,5011,5011.0,ashlyn ct,ASHLYN CT,,,soldotna,SOLDOTNA,AK,TN


2989_257561


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
35,2989_257561,,Nicole,L,Lesa,Gjibbons,Gibbonst,05/31/1984,19840531,,,saderly rd nw,WAVERLY RD NW,,,high point,HIGH POINT,NC,MT
36,2989_257561,,Nicole,L,Lesa,Gjibbons,Gibbonst,05/31/1984,19840531,,,saderly rd nw,,,,high point,,NC,
37,2989_257561,,Nicole,L,Lesa,Gjibbons,Gibbonst,05/31/1984,19840531,,,saderly rd nw,WAVERLY RD NW,,,high point,HIGH POINT,NC,NC


3298_1053751


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
38,3298_1053751,Jackie,Jackie,G,Gerald,Torres,Torres,,19410603,151,151.0,s louise ave,S LOUISE AVE,,,,ELMHURST,IL,IL
39,3298_1053751,Jackie,Jackie,G,Gerald,Torres,Torres,,19410603,151,,s louise ave,,,,,,IL,


2863_347199


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
40,2863_347199,Ann,Ann,H,M,Kyi,Kyi,05/01/1973,,598,598.0,triplelake dr,TRIPLELAKE DR,,,fresno,FRESNO,CA,CA
41,2863_347199,Ann,Ann,H,M,Kyi,Kyi,05/01/1973,,598,,triplelake dr,,,,fresno,,CA,


3776_972058


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
42,3776_972058,Tyler,Tyler,M,Maahco,Nagle,Nagle,05/22/1998,19980522,3381,3381.0,emerald grove avenue,EMERALD GROVE AVENUE,,,decatur,DECATUR,GA,
43,3776_972058,Tyler,Tyler,M,Maahco,Nagle,Nagle,05/22/1998,19980522,3381,,emerald grove avenue,,,,decatur,,GA,
44,3776_972058,Tyler,Tyler,M,Maahco,Nagle,Nagle,05/22/1998,19980522,3381,3381.0,emerald grove avenue,EMERALD GROVE AVENUE,,,decatur,DECATUR,GA,GA
