# Simulated PIK statistics

Here we inspect the accuracy and characteristics of the PIKs assigned,
leveraging our knowledge of ground truth from pseudopeople.

It wouldn't be possible to do the ground truth part with the real PVS, but
Layne, Wagner, and Rothhaas did something similar by redacting SSN from real records,
sending them through PVS without the SSN, and then using the true SSN
as ground truth.
The health care records they used are probably quite different from a CUF,
but they found a **very** good overall PIK accuracy (see cell below).

In [1]:
import datetime

from vivarium_research_prl import distributed_compute, utils

In [2]:
print(datetime.datetime.now())

2024-02-12 16:52:09.282716


In [3]:
# DO NOT EDIT if this notebook is not called ground_truth_accuracy.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
data_to_use = 'small_sample'
simulated_data_output_dir = 'generate_simulated_data/output'
case_study_output_dir = 'output'

# The "compute engine" is what we use on the Python side
# for our case-study-specific operations,
# as opposed to the Splink engine
compute_engine = 'pandas'
# Only matter if using a distributed compute engine
compute_engine_num_jobs = 3
compute_engine_cpus_per_job = 2
compute_engine_memory_per_job = "5GB"

In [4]:
# Parameters
data_to_use = "ri"
simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"
compute_engine = "dask"
compute_engine_num_jobs = 20
compute_engine_memory_per_job = "30GB"
compute_engine_cpus_per_job = 2


In [5]:
# Parameters for a USA run
# data_to_use = "usa"
# simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
# case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"

# compute_engine = 'dask'
# compute_engine_num_jobs = 50
# compute_engine_memory_per_job = "120GB"
# compute_engine_cpus_per_job = 2

In [6]:
case_study_output_dir = f'{case_study_output_dir}/{data_to_use}'
simulated_data_output_dir = f'{simulated_data_output_dir}/{data_to_use}'

In [7]:
import os
from pathlib import Path
os.environ["PATH"] = f"{Path('./slurm_within_singularity').resolve()}:{os.environ['PATH']}"

In [8]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=compute_engine_num_jobs,
    cpus_per_job=compute_engine_cpus_per_job,
    memory_per_job=compute_engine_memory_per_job,
)

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.111.17:8787/status,

0,1
Dashboard: http://10.158.111.17:8787/status,Workers: 20
Total threads: 20,Total memory: 558.80 GiB

0,1
Comm: tcp://10.158.111.17:45491,Workers: 20
Dashboard: http://10.158.111.17:8787/status,Total threads: 20
Started: 26 minutes ago,Total memory: 558.80 GiB

0,1
Comm: tcp://10.158.96.180:39289,Total threads: 1
Dashboard: http://10.158.96.180:39491/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.180:39831,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-gf2pt7ax,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-gf2pt7ax

0,1
Comm: tcp://10.158.106.44:42643,Total threads: 1
Dashboard: http://10.158.106.44:32929/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.44:39723,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-autt27fn,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-autt27fn

0,1
Comm: tcp://10.158.111.9:43741,Total threads: 1
Dashboard: http://10.158.111.9:36073/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.9:38237,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1ma3alb0,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1ma3alb0

0,1
Comm: tcp://10.158.96.150:34877,Total threads: 1
Dashboard: http://10.158.96.150:42301/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.150:36395,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-66gtv_uo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-66gtv_uo

0,1
Comm: tcp://10.158.148.235:46819,Total threads: 1
Dashboard: http://10.158.148.235:34369/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.235:34255,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-54lddibw,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-54lddibw

0,1
Comm: tcp://10.158.96.146:35099,Total threads: 1
Dashboard: http://10.158.96.146:40847/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.146:34957,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3eydnocj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3eydnocj

0,1
Comm: tcp://10.158.148.23:38265,Total threads: 1
Dashboard: http://10.158.148.23:38473/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.23:33923,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-i7419ysp,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-i7419ysp

0,1
Comm: tcp://10.158.96.38:41599,Total threads: 1
Dashboard: http://10.158.96.38:45587/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:35087,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1z76pvc7,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1z76pvc7

0,1
Comm: tcp://10.158.148.232:38679,Total threads: 1
Dashboard: http://10.158.148.232:44003/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.232:33287,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_t57ubzo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_t57ubzo

0,1
Comm: tcp://10.158.148.56:46695,Total threads: 1
Dashboard: http://10.158.148.56:33799/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.56:38777,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bhoi1mbi,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bhoi1mbi

0,1
Comm: tcp://10.158.106.26:36891,Total threads: 1
Dashboard: http://10.158.106.26:36887/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.26:35723,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vvrwzqvq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vvrwzqvq

0,1
Comm: tcp://10.158.96.137:45101,Total threads: 1
Dashboard: http://10.158.96.137:38411/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.137:36757,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uk5bwwho,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uk5bwwho

0,1
Comm: tcp://10.158.96.185:37505,Total threads: 1
Dashboard: http://10.158.96.185:39569/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.185:36861,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xcs5ebi1,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xcs5ebi1

0,1
Comm: tcp://10.158.100.42:46819,Total threads: 1
Dashboard: http://10.158.100.42:42041/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.42:41801,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-l25c7ggu,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-l25c7ggu

0,1
Comm: tcp://10.158.96.184:40485,Total threads: 1
Dashboard: http://10.158.96.184:44687/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.184:34331,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-plnde80t,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-plnde80t

0,1
Comm: tcp://10.158.96.38:45799,Total threads: 1
Dashboard: http://10.158.96.38:45639/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:46263,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uwtk7l0t,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uwtk7l0t

0,1
Comm: tcp://10.158.147.211:34323,Total threads: 1
Dashboard: http://10.158.147.211:38331/status,Memory: 27.94 GiB
Nanny: tcp://10.158.147.211:35173,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-e1n2oixr,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-e1n2oixr

0,1
Comm: tcp://10.158.111.9:33307,Total threads: 1
Dashboard: http://10.158.111.9:44867/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.9:45683,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-gittxy4p,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-gittxy4p

0,1
Comm: tcp://10.158.106.8:44175,Total threads: 1
Dashboard: http://10.158.106.8:38349/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:38235,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2ykx8t06,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2ykx8t06

0,1
Comm: tcp://10.158.106.62:33175,Total threads: 1
Dashboard: http://10.158.106.62:37827/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.62:33807,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d7rpagra,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d7rpagra


In [9]:
census_2030_piked = df_ops.read_parquet(f'{case_study_output_dir}/census_2030_piked.parquet')
confirmed_piks_with_ground_truth = df_ops.read_parquet(f'{case_study_output_dir}/confirmed_piks.parquet')

Imbalanced dataframe: too_few=True, too_many=False, too_large=False
count    3.000000e+00
mean     8.964830e+07
std      1.098194e+05
min      8.952239e+07
25%      8.961028e+07
50%      8.969817e+07
75%      8.971125e+07
max      8.972434e+07
dtype: float64


Imbalanced dataframe: too_few=True, too_many=False, too_large=False
count    2.000000e+00
mean     9.232265e+07
std      9.044807e+06
min      8.592701e+07
25%      8.912483e+07
50%      9.232265e+07
75%      9.552047e+07
max      9.871830e+07
dtype: float64


In [10]:
piked_proportion = df_ops.compute(census_2030_piked.pik.notnull().mean())
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

85.85% of the input records were PIKed


In [11]:
# Multiple Census rows assigned the same PIK, indicating the model thinks they are duplicates in Census
pik_sizes = df_ops.persist(df_ops.groupby_agg_small_groups(census_2030_piked, by='pik', agg_func=lambda x: x.size()))
df_ops.compute(pik_sizes.value_counts())

1    941940
2       477
3         2
Name: count, dtype: int64

In [12]:
# Interesting: in pseudopeople, sometimes siblings are assigned the same (common) first name, making them almost identical.
# The only giveaway is their age and DOB.
# Presumably, this tends not to happen in real life.
duplicate_piks = pik_sizes.rename('pik_size').reset_index().pipe(lambda df: df[df.pik_size > 1])

df_ops.head(census_2030_piked.merge(duplicate_piks, on="pik")).sort_values('pik')

Unnamed: 0,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,record_id,pik,pik_size
6,8997_599780,,D,Mungarro,28,02/11/2029,178.0,squash hllw rd,,providence,RI,2915,Household,Reference person,Female,Latino,2030,simulated_census_2030_2_47916,12_8442,2
7,8997_599780,Jose,G,Mungarro,1,02/11/2029,,squash hllw rd,,providence,RI,2915,Household,Biological child,Male,Latino,2030,simulated_census_2030_2_49403,12_8442,2
0,8612_248111,Amanda,D,Linkerfeldd,51,10/11/1978,463.0,courvjlle ave,unit 1359,westerly,RI,2860,Household,Opposite-sex spouse,Female,Latino,2030,simulated_census_2030_2_18594,20_16326,2
1,8612_248111,Alana,S,Lingerfeldt,19,10/11/1978,463.0,courville ave,unit 1359,westerly,RI,2860,Household,Biological child,Female,Latino,2030,simulated_census_2030_2_18595,20_16326,2
8,9225_138217,Cody,J,Wilsoj,42,04/23/1987,40058.0,corey ave sw,,providence,RI,2840,Household,Opposite-sex spouse,Male,White,2030,simulated_census_2030_2_61837,51_4315,2
9,9225_138217,Jackson,J,Wilson,11,04/23/1987,40058.0,corey ave sw,,providence,RI,2840,Household,Biological child,Male,White,2030,simulated_census_2030_2_61838,51_4315,2
4,8877_33562,Cayden,,Elmer,9,01/20/2021,382.0,hill ave,,lincoln,,2920,Household,Biological child,Male,White,2030,simulated_census_2030_2_43040,70_27612,2
5,8877_33562,Yuna,A,Elmer,9,01/20/2021,382.0,hill ave,,lincoln,RI,2920,Household,Biological child,Female,White,2030,simulated_census_2030_2_43044,70_27612,2
2,8641_349693,Husband,U,Judkins,60,07/21/1967,409.0,red gum st,,coventry,RI,2857,Household,Reference person,Female,,2030,simulated_census_2030_2_24831,80_11577,2
3,8641_349693,Brian,T,Judkins,62,07/21/1967,409.0,tred gum st,,coventry,RI,2857,Household,Opposite-sex spouse,Male,White,2030,simulated_census_2030_2_24832,80_11577,2


## Ground truth statistics

In [13]:
census_2030_ground_truth = df_ops.persist(
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_census_2030_ground_truth.parquet')
)

Imbalanced dataframe: too_few=True, too_many=False, too_large=False
count    3.000000e+00
mean     2.367794e+07
std      1.118165e+07
min      1.076649e+07
25%      2.044768e+07
50%      3.012888e+07
75%      3.013367e+07
max      3.013846e+07
dtype: float64


In [14]:
# In this version of pseudopeople, there are no actual duplicates in Census,
# which means all of the duplicates identified above are wrong.
assert len(census_2030_ground_truth) == len(df_ops.drop_duplicates(census_2030_ground_truth))

Imbalanced dataframe: too_few=True, too_many=False, too_large=False
count    3.000000e+00
mean     2.084089e+07
std      9.848013e+06
min      9.469383e+06
25%      1.799519e+07
50%      2.652099e+07
75%      2.652664e+07
max      2.653229e+07
dtype: float64


In [15]:
reference_files_ground_truth = df_ops.persist(df_ops.concat([
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_geobase_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_name_dob_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
], ignore_index=True))

In [16]:
# However, there can be reference file records that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(reference_files_ground_truth, by='record_id', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    3105541
2    1552628
3      55256
4       3012
5        156
6          9
7          1
Name: count, dtype: int64

In [17]:
reference_files_ground_truth = df_ops.persist(reference_files_ground_truth.merge(
    n_unique_simulants,
    on='record_id',
    how='left',
))
reference_files_ground_truth.head(n=100)

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_27_14069,8192_1107427,2
1,simulated_geobase_reference_file_27_14069,7016_1077345,2
2,simulated_geobase_reference_file_27_16234,7927_1071261,2
3,simulated_geobase_reference_file_27_16234,3723_616629,2
4,simulated_geobase_reference_file_27_16877,558_1009816,2
...,...,...,...
95,simulated_geobase_reference_file_27_45078,9292_766079,1
96,simulated_geobase_reference_file_27_45085,6487_7497,1
97,simulated_geobase_reference_file_27_45634,5950_742843,2
98,simulated_geobase_reference_file_27_45634,9402_133838,2


In [18]:
df_ops.head(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == df_ops.compute(reference_files_ground_truth.n_unique_simulants.max())])

Unnamed: 0,record_id,simulant_id,n_unique_simulants
44377,simulated_geobase_reference_file_37_46808,7125_711458,7
44378,simulated_geobase_reference_file_37_46808,7125_711462,7
44379,simulated_geobase_reference_file_37_46808,7125_711454,7
44380,simulated_geobase_reference_file_37_46808,6545_389063,7
44381,simulated_geobase_reference_file_37_46808,7125_711460,7
44382,simulated_geobase_reference_file_37_46808,7125_711463,7
44383,simulated_geobase_reference_file_37_46808,7125_711453,7


In [19]:
census_2030_ground_truth = df_ops.persist(census_2030_ground_truth.merge(
    df_ops.drop_duplicates(reference_files_ground_truth[['simulant_id']]).assign(possible_to_pik=1),
    on='simulant_id',
    how='left',
).assign(possible_to_pik=lambda df: df.possible_to_pik.fillna(0)))
possible_to_pik_proportion = df_ops.compute(census_2030_ground_truth.possible_to_pik.mean())
print(
    f'{(1 - possible_to_pik_proportion):.2%} of the input records are '
    'impossible to PIK correctly, since they are not in any reference files'
)

0.51% of the input records are impossible to PIK correctly, since they are not in any reference files


In [20]:
print(
    f'Assigned PIKs to {(piked_proportion / possible_to_pik_proportion):.2%} of PIK-able records'
)

Assigned PIKs to 86.30% of PIK-able records


In [21]:
reference_file = df_ops.concat([
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_geobase_reference_file.parquet',
    ),
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_name_dob_reference_file.parquet',
    ),
], ignore_index=True)

In [22]:
reference_file_piks = df_ops.persist(reference_file[['record_id', 'pik']])
reference_file_piks

Unnamed: 0_level_0,record_id,pik
npartitions=188,Unnamed: 1_level_1,Unnamed: 2_level_1
,large_string[pyarrow],large_string[pyarrow]
,...,...
...,...,...
,...,...
,...,...


In [23]:
assert len(reference_file_piks) == len(df_ops.drop_duplicates(reference_file_piks[['record_id']]))

In [24]:
pik_simulant_pairs = df_ops.persist(df_ops.drop_duplicates(reference_files_ground_truth.merge(reference_file_piks, on='record_id')[['pik', 'simulant_id']]))

In [25]:
# However, there can be PIKs that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(pik_simulant_pairs, by='pik', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
3     648740
2     622121
5     164197
1     126280
4      84569
7      32508
6      29063
8       8479
9       5805
10      2914
11      1237
12       883
13       404
14       382
16       196
15       184
17       105
18       104
19        75
20        40
22        35
21        33
25        17
23        15
24        14
26        11
27         8
28         6
30         3
29         2
32         2
36         2
34         1
Name: count, dtype: int64

In [26]:
pik_simulant_pairs = df_ops.persist(pik_simulant_pairs.merge(
    n_unique_simulants,
    on='pik',
    how='left',
))
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=188,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [27]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == df_ops.compute(pik_simulant_pairs.n_unique_simulants.max())])

Unnamed: 0,pik,simulant_id,n_unique_simulants
21650,19_9678,6539_306331,36
21651,19_9678,3167_24580,36
21652,19_9678,5703_612686,36
21653,19_9678,4672_664811,36
21654,19_9678,6654_869908,36
21655,19_9678,6790_78098,36
21656,19_9678,4260_726870,36
21657,19_9678,3304_605514,36
21658,19_9678,7359_424878,36
21659,19_9678,1935_606207,36


## Definitions of accuracy

1. (most strict) Assigning any PIK with multiple simulants is incorrect
2. Assigning a PIK with multiple simulants is neither incorrect nor correct (excluded from denominator)
3. (most lenient) Assigning a PIK with multiple simulants is correct, as long as at least one of those simulants matches the truth

In [28]:
# All modules, Medicare database, calculated from Layne, Wagner, and Rothhaas Table 1 (p. 15)
real_life_pvs_accuracy = 1 - (2_585 + 60_709 + 129_480 + 89_094) / (52_406_981 + 5_170_924 + 49_374_794 + 50_327_034)
f'{real_life_pvs_accuracy:.5%}'

'99.82079%'

### Definition 1

In [29]:
piks_assigned = df_ops.compute(census_2030_piked.pik.notnull().sum())
piks_assigned

942900

In [30]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants > 1])

Unnamed: 0,pik,simulant_id,n_unique_simulants
0,47_14391,5399_690029,2
1,47_14391,5072_473330,2
2,47_14570,5399_594583,2
3,47_14570,3254_712882,2
4,47_14607,4400_342761,5
5,47_14607,8817_402119,5
6,47_14607,8509_706510,5
7,47_14607,3298_740748,5
8,47_14607,5399_22101,5
11,47_15153,9292_733960,2


In [31]:
single_sim_piks_correct = df_ops.compute(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_piks_correct

64790

In [32]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / piks_assigned
)

0.06871354332378832

In [33]:
assert len(confirmed_piks_with_ground_truth) == piks_assigned

In [34]:
df_ops.head(census_2030_ground_truth.rename(columns={'record_id': 'record_id_census_2030'}))

Unnamed: 0,record_id_census_2030,simulant_id,possible_to_pik
0,simulated_census_2030_2_29,8425_13861,1.0
1,simulated_census_2030_2_50,8425_25223,1.0
2,simulated_census_2030_2_92,8425_41180,1.0
3,simulated_census_2030_2_262,8425_113842,1.0
4,simulated_census_2030_2_426,8425_181938,1.0
5,simulated_census_2030_2_512,8425_215012,1.0
6,simulated_census_2030_2_551,8425_233178,1.0
7,simulated_census_2030_2_646,8425_273682,1.0
8,simulated_census_2030_2_653,8425_276622,1.0
9,simulated_census_2030_2_857,8425_362226,1.0


In [35]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_correct = df_ops.compute(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_record_links_correct

82564

In [36]:
(
    single_sim_record_links_correct / piks_assigned
)

0.08756389861066921

### Definition 2

In [37]:
single_sim_piks_assigned = len(census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == 1][['pik', 'simulant_id']]))
single_sim_piks_assigned

67014

In [38]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / single_sim_piks_assigned
)

0.9668129047661682

In [39]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_assigned = df_ops.compute(
    (confirmed_piks_with_ground_truth
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .n_unique_simulants == 1).sum()
)
single_sim_record_links_assigned

440042

In [40]:
(
    single_sim_record_links_correct / single_sim_record_links_assigned
)

0.18762754464346584

### Definition 3

In [41]:
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=188,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [42]:
piks_at_least_partially_correct = df_ops.persist(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(df_ops.drop_duplicates)
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id", "pik"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id,pik,correct
npartitions=188,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],bool[pyarrow]
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [43]:
# Overall accuracy, treating it as a black box
piks_correct_proportion = (df_ops.compute(piks_at_least_partially_correct.correct.sum()) / piks_assigned)
piks_correct_proportion

0.5675723830734967

In [44]:
print(f'{piks_correct_proportion:.5%} of the PIKs assigned were correct; compare with {real_life_pvs_accuracy:.5%} in real life')

56.75724% of the PIKs assigned were correct; compare with 99.82079% in real life


In [45]:
# Looking at whether the exact *record* linked was from the same simulant
sim_record_links_at_least_partially_correct = df_ops.persist(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id_raw_input_file", "record_id_reference_file", "pik", "module_name", "pass_name"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
sim_record_links_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file,pik,module_name,pass_name,correct
npartitions=112,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],large_string[pyarrow],bool[pyarrow]
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [46]:
len(sim_record_links_at_least_partially_correct)

942900

In [47]:
len(sim_record_links_at_least_partially_correct[['record_id_raw_input_file', 'record_id_reference_file']].drop_duplicates())

942900

In [48]:
(
    df_ops.compute(sim_record_links_at_least_partially_correct.correct.sum()) / piks_assigned
)

0.202855021741436

In [49]:
assert df_ops.compute((df_ops.groupby_agg_small_groups(confirmed_piks_with_ground_truth, by='record_id_raw_input_file', agg_func=lambda x: x.record_id_reference_file.nunique()) <= 1).all())

In [50]:
# Using definition 3 -- at the PIK level
piks_at_least_partially_correct = df_ops.persist(
    piks_at_least_partially_correct
        .rename(columns={'record_id': 'record_id_raw_input_file'})
        .merge(confirmed_piks_with_ground_truth[['record_id_raw_input_file', 'module_name', 'pass_name']], on='record_id_raw_input_file')
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,pik,correct,module_name,pass_name
npartitions=188,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,large_string[pyarrow],large_string[pyarrow],bool[pyarrow],large_string[pyarrow],large_string[pyarrow]
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [51]:
# Accuracy by module -- note that this shows the opposite pattern (with the sample data)
# relative to the results of Layne et al., who found GeoSearch was much *more* accurate
df_ops.compute(piks_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
namesearch,0.507764,42955
dobsearch,0.561717,3192
geosearch,0.570005,880505
hhcompsearch,0.595027,16248


In [52]:
# Accuracy by pass -- could be used to tune pass-specific cutoffs, but
# this might not be too informative while we are still using the sample data.
df_ops.compute(piks_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.0,2
dobsearch,first three characters of name,0.0,1
geosearch,geokey name switch,0.5,8
namesearch,DOB and NYSIIS of name,0.500751,31271
namesearch,year of birth and first two characters of name,0.50885,678
hhcompsearch,year of birth,0.523145,3154
namesearch,DOB and initials,0.523253,9762
geosearch,house number and street name Soundex,0.559135,56278
geosearch,some name and DOB information,0.560421,150808
dobsearch,reverse Soundex of name,0.561102,1743


In [53]:
# Using definition 3 -- at the link level
df_ops.compute(sim_record_links_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
geosearch,0.190662,880505
hhcompsearch,0.191285,16248
namesearch,0.438878,42955
dobsearch,0.448935,3192


In [54]:
df_ops.compute(sim_record_links_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.0,2
dobsearch,first three characters of name,0.0,1
geosearch,geokey name switch,0.125,8
hhcompsearch,year of birth,0.181674,3154
geosearch,house number and street name Soundex,0.188759,56278
geosearch,some name and DOB information,0.188975,150808
geosearch,geokey,0.1912,673411
hhcompsearch,initials,0.1936,13094
namesearch,DOB and initials,0.430854,9762
namesearch,birthday and first two characters of name,0.43328,1244


In [55]:
df_ops.compute(sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct].groupby(["module_name", "pass_name"]).size()).sort_values()

module_name   pass_name                                           
dobsearch     first three characters of name                               1
              initials name switch                                         2
geosearch     geokey name switch                                           7
namesearch    year of birth and first two characters of name             384
              birthday and first two characters of name                  705
dobsearch     first two characters of first name and year of birth       797
              reverse Soundex of name                                    959
hhcompsearch  year of birth                                             2581
namesearch    DOB and initials                                          5556
hhcompsearch  initials                                                 10559
namesearch    DOB and NYSIIS of name                                   17458
geosearch     house number and street name Soundex                     45655
         

### Incorrect and missed PIKs

In [56]:
incorrectly_linked_pairs = df_ops.drop_duplicates(
    sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct]
        [["record_id_raw_input_file", "record_id_reference_file"]]
)
incorrectly_linked_pairs

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file
npartitions=112,Unnamed: 1_level_1,Unnamed: 2_level_1
,large_string[pyarrow],large_string[pyarrow]
,...,...
...,...,...
,...,...
,...,...


In [57]:
len(incorrectly_linked_pairs)

751628

In [58]:
comparison_cols = [
    "first_name",
    "middle_name",
    "last_name",
    "date_of_birth",
    "street_number",
    "street_name",
    "unit_number",
    "city",
    "state",
]

incorrect_links = (
    incorrectly_linked_pairs
        .merge(
            census_2030_piked
                .rename(columns={"record_id": "record_id_raw_input_file", "middle_initial": "middle_name"})
                [["record_id_raw_input_file"] + comparison_cols],
            on="record_id_raw_input_file",
            how="left",
        )
        .merge(
            reference_file
                .rename(columns={"record_id": "record_id_reference_file"})
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                [["record_id_reference_file"] + comparison_cols],
            on="record_id_reference_file",
            how="left",
            suffixes=("_census", "_reference_file"),
        )
)
def flatten(xss):
    return [x for xs in xss for x in xs]

df_ops.head(incorrect_links[flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])])

Unnamed: 0,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,Arman,Arman,N,Nicholas,Durham,Durham,11/12/2002,20021112,22428,,main street,,,,east greenwich,PROVIDENCE,,RI
1,Wendy,Wendy,C,Crystal,Chi,Chi,10/16/1977,19771016,622,622.0,deering st,DEERING ST,,,blk island,BLK ISLAND,RI,RI
2,Charlee,Charlee,G,Gracie,Baird,Baird,03/06/2028,20280306,207,,allegheny ave,,,,richmond,,RI,
3,Charlene,Charlene,A,Angela,Hood,Hood,07/04/1964,19640704,2683,2683.0,conover rd,CONOVER RD,,,woonsocket,WOONSOCKET,RI,RI
4,Jordan,Jordan,H,Jonathan,Shore,Shore,03/25/2012,20120325,1970,1970.0,south linwood avenue,SOUTH LINWOOD AVENUE,,,north providence,NORTH PROVIDENCE,RI,RI
5,Tzipora,Tzipora,M,Macie,Pierre,Pierre,01/20/2004,20040120,277,277.0,fairview av,FAIRVIEW AVE,,,warwick,WARWICK,RI,RI
6,Brantley,Brantley,M,Mylo,Deshay,Deshay,11/16/2013,20131116,768,768.0,dunvegan dr,DUNVEGAN DR,,,block island,BLOCK ISLAND,RI,RI
7,GBrebnan,Brennan,K,Kaysen,Papadopoulos,Papadopoulos,04/27/2017,20170427,2615,2615.0,rd n,RD N,,,providence,PROVIDENCE,RI,RI
8,Ayden,Ayden,E,Eli,Briggs,Briggs,10/18/2021,20211018,16425,,piney grove dr,,,,narragansett,,RI,
9,Nennjfer,Jennifer,A,Angela,Woodward,Woodward,01/07/1972,19720107,101,101.0,mcginnis ferry rd,MCGINNIS FERRY RD,,,east providence,EAST PROVIDENCE,RI,RI


In [59]:
reference_files_ground_truth

Unnamed: 0_level_0,record_id,simulant_id,n_unique_simulants
npartitions=112,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,large_string[pyarrow],large_string[pyarrow],int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [60]:
missed_links = df_ops.persist(
    census_2030_piked[census_2030_piked.pik.isnull()].drop(columns=["pik"])
        .rename(columns={"middle_initial": "middle_name"})
        .merge(census_2030_ground_truth, on="record_id")
        .merge(reference_file.rename(columns=lambda c: c.replace('mailing_address_', '')).merge(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == 1], on="record_id"), on="simulant_id", suffixes=("_census", "_reference_file"))
)

In [61]:
df_ops.compute(census_2030_piked.pik.isnull().sum())

155364

In [62]:
len(missed_links)

271427

In [63]:
simulants_missed = df_ops.head(missed_links[['simulant_id']], n=100).simulant_id.unique()
simulants_missed

<ArrowExtensionArray>
[ '8509_290877',  '8516_542542', '9549_1114486',  '5439_527260',
  '5628_747537',  '5950_169461',  '6324_465458',  '6800_349159',
 '7264_1146002', '7384_1067881',  '7745_524163',  '1609_880974',
  '4621_921183',   '465_658731',   '6991_92179',   '7086_68503',
  '7384_978319',   '778_352858',  '1609_855014',   '1648_44615',
  '3465_579205',  '3481_504676',  '3607_964500',  '8997_534744',
  '5150_248497',  '5781_584065',  '6520_503535', '6539_1182918',
  '7985_937827', '1917_1115761',   '446_319868',   '446_993375',
  '8527_648373',  '8817_710744',   '974_477935',  '9840_563471',
  '5698_601179',  '6519_830648',   '656_675044',   '6606_46389',
  '6701_661155',  '7264_576200',  '7384_875348',   '2298_15984',
  '3793_941611']
Length: 45, dtype: large_string[pyarrow]

In [64]:
for simulant in simulants_missed[0:15]:
    print(simulant)
    display(df_ops.head(missed_links[missed_links.simulant_id == simulant][['simulant_id'] + flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])], n=100))

8509_290877


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,8509_290877,Aufrety,Maha,C,Monique,,Mitchell,07/04/2006,20050207,2631,702.0,w main st,RICHLAND,,,cumberland,NORTH KINGSTOWN,RI,RI
1,8509_290877,Aufrety,,C,Cassidy,,Maloney,07/04/2006,20060704,2631,,w main st,,,,cumberland,,RI,


8516_542542


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
2,8516_542542,Kimberly,Cooper,,Archie,Lee,Beck,12/29/1984,20210102,907,1616.0,crescent woods dr,ROCK CREEK VILLA DR,,,east providence,PAWTUCKET,RI,RI
3,8516_542542,Kimberly,Kimberly,,Teri,Lee,Lee,12/29/1984,19841229,907,,crescent woods dr,,,,east providence,,RI,
4,8516_542542,Kimberly,Millie,,Leah,Lee,Nash,12/29/1984,20271014,907,407.0,crescent woods dr,GREENBERRY DRIVE,,,east providence,COVENTRY,RI,RI


9549_1114486


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
5,9549_1114486,Ava,Brandon,J,Esteban,Montoya-Natividad,Young,19/05/2006,19801208,,,south transit road,,,,charlestown,,RI,


5439_527260


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
6,5439_527260,Bernard,Rickey,L,Wayne,Lacasse,Mccain,07/13/1962,19600116,4264,,main street,,,,e providence,,RI,


5628_747537


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
7,5628_747537,Stephanie,Timothy,T,Lonnie,Harrison,Searcy,01/20/0973,19661218,9601,,140th ave se,,,,burrillville,,RI,


5950_169461


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
8,5950_169461,Teagan,Jo,A,Dorthy,Head,Fedder,10/06/2005,19460625,7011,,homestead ln,,number a,,providence,,RI,


6324_465458


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
9,6324_465458,Matthew,Matthew,C,Christopher,Landa,,07/28/1961,19610728,,,heyef av,,,,burrillville,,RI,
10,6324_465458,Matthew,Athena,C,Regina,Landa,Willoughby,07/28/1961,19790615,,17.0,heyef av,NORFORK DR,,,burrillville,PAWTUCKET,RI,RI
11,6324_465458,Matthew,Derrick,C,Joshua,Landa,Clark,07/28/1961,19860806,,244.0,heyef av,ROSE LANE,,,burrillville,,RI,RI
12,6324_465458,Matthew,Derrick,C,Joshua,Landa,Clark,07/28/1961,19860806,,244.0,heyef av,ROSE LANE,,,burrillville,PORTSMOUTH,RI,RI


6800_349159


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
13,6800_349159,Isabella,Samir,A,Alan,Aldridge,Ly,01/16/2013,20241209,4332,,morning lt ter,,,,cranston,,RI,


7264_1146002


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
14,7264_1146002,D,Adam,J,James,Whitley,Whitley,04/19/2027,20278419,6900,,club cart cir,,,,cranston,,RI,
15,7264_1146002,D,Jessica,J,Sofia,Whitley,Corwin,04/19/2027,20050726,6900,3178.0,club cart cir,BRICK MILL RD,,,cranston,,RI,RI


7384_1067881


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
16,7384_1067881,Richard,Dichard,E,Eleazar,Peterson,Peterson,11/16/2022,28221146,407,,greenberry drive,,,,coventry,,RI,
17,7384_1067881,Richard,Dichard,E,Eleazar,Peterson,Peterson,11/16/2022,28221146,407,407.0,greenberry drive,GREENBERRY DRIVE,,,coventry,COVENTRY,RI,RI


7745_524163


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
18,7745_524163,Darius,Kimberly,E,Amy,Smith,Ceballos,08/05/1971,19730711,758,,w olive ave,,,,johnston,,RI,


1609_880974


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
19,1609_880974,Frank,Taylor,T,Natalie,Hayes,Miller,07/24/1963,20130227,2383,,tulane,,,,warwick,,RI,


4621_921183


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
20,4621_921183,Trent,Connor,M,Robert,GNelson,Michel,76/31/2007,20011205,215,,fm 1387,,,,richmond,,RI,


465_658731


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
21,465_658731,Carla,Carla,K,K,Shafer,Shafer,11/22/1973,,7277,,north saqe stree,,,,portsmouth,,RI,
22,465_658731,Carla,Robert,K,Micah,Shafer,Walton,11/22/1973,19902225.0,7277,6.0,north saqe stree,MARAVILLA AVE,,,portsmouth,PORTJMOTH,RI,RI


6991_92179


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
23,6991_92179,Ryan,Lucia,H,Jessica,Waite,Lucero,19/13/2000,20100131,691,,jonesboro road southeast,,,,cranston,,LA,
