# Simulated PIK statistics

Here we inspect the accuracy and characteristics of the PIKs assigned,
leveraging our knowledge of ground truth from pseudopeople.

It wouldn't be possible to do the ground truth part with the real PVS, but
Layne, Wagner, and Rothhaas did something similar by redacting SSN from real records,
sending them through PVS without the SSN, and then using the true SSN
as ground truth.
The health care records they used are probably quite different from a CUF,
but they found a **very** good overall PIK accuracy (see cell below).

In [1]:
# Query planning is now on by default, but it has some rough edges.
# See https://github.com/dask/dask/issues/10995 for general discussion
# and https://github.com/dask/dask-expr/issues/1060 for the particular
# issue I ran into.
import dask
dask.config.set({"dataframe.query-planning": False})

<dask.config.set at 0x7f16e43dac20>

In [2]:
import datetime, os, time

from vivarium_research_prl import distributed_compute, utils
from IPython.display import display

In [3]:
print(datetime.datetime.now())

2024-05-24 07:26:32.677296


In [4]:
# DO NOT EDIT if this notebook is not called ground_truth_accuracy.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
data_to_use = 'small_sample'
simulated_data_output_dir = 'output/generate_simulated_data'
case_study_output_dir = 'output'

# The "compute engine" is what we use on the Python side
# for our case-study-specific operations,
# as opposed to the Splink engine
compute_engine = 'pandas'
# Only matter if using a distributed compute engine
compute_engine_num_jobs = 3
compute_engine_cpus_per_job = 2
compute_engine_memory_per_job = "5GB"
queue = "long.q"
local_directory = f"/tmp/{os.environ['USER']}_{int(time.time())}_dask"

In [5]:
# Parameters
data_to_use = "usa"
simulated_data_output_dir = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study/generate_simulated_data/"
case_study_output_dir = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study/results/"
compute_engine = "dask"
compute_engine_num_jobs = 50
compute_engine_memory_per_job = "120GB"
compute_engine_cpus_per_job = 2
local_directory = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/"


In [6]:
if compute_engine == 'dask':
    utils.ensure_empty(local_directory)

In [7]:
case_study_output_dir = f'{case_study_output_dir}/{data_to_use}'
simulated_data_output_dir = f'{simulated_data_output_dir}/{data_to_use}'

In [8]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=compute_engine_num_jobs,
    cpus_per_job=compute_engine_cpus_per_job,
    memory_per_job=compute_engine_memory_per_job,
    queue=queue,
    local_directory=local_directory,
)

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.111.40:8787/status,

0,1
Dashboard: http://10.158.111.40:8787/status,Workers: 50
Total threads: 50,Total memory: 5.46 TiB

0,1
Comm: tcp://10.158.111.40:41733,Workers: 50
Dashboard: http://10.158.111.40:8787/status,Total threads: 50
Started: Just now,Total memory: 5.46 TiB

0,1
Comm: tcp://10.158.148.56:41911,Total threads: 1
Dashboard: http://10.158.148.56:41881/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.56:32991,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hlk6z0s3,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hlk6z0s3

0,1
Comm: tcp://10.158.106.36:44081,Total threads: 1
Dashboard: http://10.158.106.36:36623/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.36:36661,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ciz4tmbf,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ciz4tmbf

0,1
Comm: tcp://10.158.147.201:39141,Total threads: 1
Dashboard: http://10.158.147.201:37809/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.201:36623,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zs84xy6u,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zs84xy6u

0,1
Comm: tcp://10.158.106.38:41437,Total threads: 1
Dashboard: http://10.158.106.38:38007/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.38:44769,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-x0ooc3hu,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-x0ooc3hu

0,1
Comm: tcp://10.158.100.147:41115,Total threads: 1
Dashboard: http://10.158.100.147:34425/status,Memory: 111.76 GiB
Nanny: tcp://10.158.100.147:34995,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ozf6ekej,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ozf6ekej

0,1
Comm: tcp://10.158.147.230:37879,Total threads: 1
Dashboard: http://10.158.147.230:39719/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.230:42873,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-l01ce4o4,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-l01ce4o4

0,1
Comm: tcp://10.158.111.40:38585,Total threads: 1
Dashboard: http://10.158.111.40:37709/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.40:42119,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hzncnyn0,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hzncnyn0

0,1
Comm: tcp://10.158.106.12:41139,Total threads: 1
Dashboard: http://10.158.106.12:40209/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.12:42421,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-11nu7167,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-11nu7167

0,1
Comm: tcp://10.158.111.40:45863,Total threads: 1
Dashboard: http://10.158.111.40:40481/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.40:33623,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-cvztcc31,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-cvztcc31

0,1
Comm: tcp://10.158.147.248:41243,Total threads: 1
Dashboard: http://10.158.147.248:43643/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.248:42303,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-u1l2juvm,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-u1l2juvm

0,1
Comm: tcp://10.158.147.217:36997,Total threads: 1
Dashboard: http://10.158.147.217:36937/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.217:40563,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-1wotzadw,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-1wotzadw

0,1
Comm: tcp://10.158.147.232:44779,Total threads: 1
Dashboard: http://10.158.147.232:35225/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.232:36071,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-z76j79yt,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-z76j79yt

0,1
Comm: tcp://10.158.147.201:37997,Total threads: 1
Dashboard: http://10.158.147.201:40239/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.201:35977,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-r3kuq110,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-r3kuq110

0,1
Comm: tcp://10.158.106.57:40417,Total threads: 1
Dashboard: http://10.158.106.57:38931/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.57:39083,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ye2ufnmn,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ye2ufnmn

0,1
Comm: tcp://10.158.147.241:41205,Total threads: 1
Dashboard: http://10.158.147.241:40611/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.241:39207,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-1v2rg9qe,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-1v2rg9qe

0,1
Comm: tcp://10.158.147.185:44545,Total threads: 1
Dashboard: http://10.158.147.185:39209/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.185:37893,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-qk46vutw,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-qk46vutw

0,1
Comm: tcp://10.158.106.38:40153,Total threads: 1
Dashboard: http://10.158.106.38:39037/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.38:36451,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-scoketqe,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-scoketqe

0,1
Comm: tcp://10.158.148.63:41543,Total threads: 1
Dashboard: http://10.158.148.63:42255/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.63:36703,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-2c9yqp5h,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-2c9yqp5h

0,1
Comm: tcp://10.158.148.54:41687,Total threads: 1
Dashboard: http://10.158.148.54:34155/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.54:33295,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-or2wukx5,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-or2wukx5

0,1
Comm: tcp://10.158.147.217:33451,Total threads: 1
Dashboard: http://10.158.147.217:42003/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.217:37477,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hcjqklv0,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-hcjqklv0

0,1
Comm: tcp://10.158.147.248:42059,Total threads: 1
Dashboard: http://10.158.147.248:42985/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.248:35373,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zk7r4lq_,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zk7r4lq_

0,1
Comm: tcp://10.158.147.217:43727,Total threads: 1
Dashboard: http://10.158.147.217:46585/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.217:38151,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-j7b5_7ow,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-j7b5_7ow

0,1
Comm: tcp://10.158.106.7:39161,Total threads: 1
Dashboard: http://10.158.106.7:37321/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.7:38503,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-7ef263pk,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-7ef263pk

0,1
Comm: tcp://10.158.148.230:39995,Total threads: 1
Dashboard: http://10.158.148.230:44917/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.230:43869,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-uuvkx786,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-uuvkx786

0,1
Comm: tcp://10.158.147.213:35801,Total threads: 1
Dashboard: http://10.158.147.213:41605/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.213:37003,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-13p9c8zv,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-13p9c8zv

0,1
Comm: tcp://10.158.111.9:34129,Total threads: 1
Dashboard: http://10.158.111.9:46697/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.9:44045,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kq2pclgo,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kq2pclgo

0,1
Comm: tcp://10.158.96.44:40013,Total threads: 1
Dashboard: http://10.158.96.44:34291/status,Memory: 111.76 GiB
Nanny: tcp://10.158.96.44:38509,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-n72x6av_,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-n72x6av_

0,1
Comm: tcp://10.158.106.36:44451,Total threads: 1
Dashboard: http://10.158.106.36:45651/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.36:35665,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-urs04_l6,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-urs04_l6

0,1
Comm: tcp://10.158.147.204:43827,Total threads: 1
Dashboard: http://10.158.147.204:35711/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.204:45779,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-egvlc2jx,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-egvlc2jx

0,1
Comm: tcp://10.158.147.217:41959,Total threads: 1
Dashboard: http://10.158.147.217:40961/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.217:37867,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-6xmk3m4d,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-6xmk3m4d

0,1
Comm: tcp://10.158.111.40:39541,Total threads: 1
Dashboard: http://10.158.111.40:40863/status,Memory: 111.76 GiB
Nanny: tcp://10.158.111.40:44065,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kcp464ct,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kcp464ct

0,1
Comm: tcp://10.158.148.230:39335,Total threads: 1
Dashboard: http://10.158.148.230:36207/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.230:35835,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-3gh0wdc7,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-3gh0wdc7

0,1
Comm: tcp://10.158.147.185:36567,Total threads: 1
Dashboard: http://10.158.147.185:36969/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.185:32995,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-rccs0fuc,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-rccs0fuc

0,1
Comm: tcp://10.158.147.217:45047,Total threads: 1
Dashboard: http://10.158.147.217:43519/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.217:45791,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-knxocwgl,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-knxocwgl

0,1
Comm: tcp://10.158.106.12:34399,Total threads: 1
Dashboard: http://10.158.106.12:33989/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.12:43455,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-9vd8fp6k,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-9vd8fp6k

0,1
Comm: tcp://10.158.147.201:45245,Total threads: 1
Dashboard: http://10.158.147.201:43291/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.201:43783,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-pvegy_gk,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-pvegy_gk

0,1
Comm: tcp://10.158.147.230:40785,Total threads: 1
Dashboard: http://10.158.147.230:44371/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.230:34181,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-jlliluh0,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-jlliluh0

0,1
Comm: tcp://10.158.106.57:46719,Total threads: 1
Dashboard: http://10.158.106.57:43897/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.57:43321,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zxij860r,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-zxij860r

0,1
Comm: tcp://10.158.148.223:42131,Total threads: 1
Dashboard: http://10.158.148.223:41111/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.223:39963,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-m9a95wyg,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-m9a95wyg

0,1
Comm: tcp://10.158.106.36:44341,Total threads: 1
Dashboard: http://10.158.106.36:45241/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.36:43021,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kpjq1k5h,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kpjq1k5h

0,1
Comm: tcp://10.158.148.235:35409,Total threads: 1
Dashboard: http://10.158.148.235:36539/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.235:34431,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-fozt2kfz,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-fozt2kfz

0,1
Comm: tcp://10.158.106.38:37147,Total threads: 1
Dashboard: http://10.158.106.38:40937/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.38:38869,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-9goq4a1s,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-9goq4a1s

0,1
Comm: tcp://10.158.100.147:38201,Total threads: 1
Dashboard: http://10.158.100.147:34071/status,Memory: 111.76 GiB
Nanny: tcp://10.158.100.147:44911,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-3oob031r,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-3oob031r

0,1
Comm: tcp://10.158.148.240:34423,Total threads: 1
Dashboard: http://10.158.148.240:40741/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.240:33013,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kmwjx2i_,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-kmwjx2i_

0,1
Comm: tcp://10.158.106.36:39559,Total threads: 1
Dashboard: http://10.158.106.36:33005/status,Memory: 111.76 GiB
Nanny: tcp://10.158.106.36:41173,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-b_fpqzl9,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-b_fpqzl9

0,1
Comm: tcp://10.158.148.240:34895,Total threads: 1
Dashboard: http://10.158.148.240:33745/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.240:34135,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-v954_f3y,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-v954_f3y

0,1
Comm: tcp://10.158.148.56:39933,Total threads: 1
Dashboard: http://10.158.148.56:46787/status,Memory: 111.76 GiB
Nanny: tcp://10.158.148.56:35303,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ks3m2dgz,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-ks3m2dgz

0,1
Comm: tcp://10.158.147.232:32835,Total threads: 1
Dashboard: http://10.158.147.232:41893/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.232:40269,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-83nr4miz,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-83nr4miz

0,1
Comm: tcp://10.158.147.230:38207,Total threads: 1
Dashboard: http://10.158.147.230:37151/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.230:44053,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-fmk03sy8,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-fmk03sy8

0,1
Comm: tcp://10.158.147.230:37119,Total threads: 1
Dashboard: http://10.158.147.230:42107/status,Memory: 111.76 GiB
Nanny: tcp://10.158.147.230:34225,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-u71ju3fd,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask/dask-scratch-space/worker-u71ju3fd


In [9]:
census_2030_piked = df_ops.read_parquet(f'{case_study_output_dir}/census_2030_piked.parquet')
confirmed_piks_with_ground_truth = df_ops.read_parquet(f'{case_study_output_dir}/confirmed_piks.parquet')

In [10]:
piked_proportion = df_ops.compute(census_2030_piked.pik.notnull().mean())
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

89.49% of the input records were PIKed


In [11]:
# Multiple Census rows assigned the same PIK, indicating the model thinks they are duplicates in Census
pik_sizes = df_ops.persist(df_ops.groupby_agg_small_groups(census_2030_piked, by='pik', agg_func=lambda x: x.size()))
df_ops.compute(pik_sizes.value_counts())

1    305481683
2       704594
3          914
4           13
Name: count, dtype: int64

In [12]:
# Interesting: in pseudopeople, sometimes siblings are assigned the same (common) first name, making them almost identical.
# The only giveaway is their age and DOB.
# Presumably, this tends not to happen in real life.
duplicate_piks = pik_sizes.rename('pik_size').reset_index().pipe(lambda df: df[df.pik_size > 1])

df_ops.head(census_2030_piked.merge(duplicate_piks, on="pik").sort_values('pik'))

Unnamed: 0,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,record_id,pik,pik_size
2716,3632_492800,Amir,M,Lopez Ponce,2,0|/1G/2028,7323.0,s pleasant 2,,hillsboro,OR,97058,Household,Biological child,Male,Latino,2030,simulated_census_2030_125_986313,100_10000288,2
2731,3632_492800,Amanda,M,Lopez Ponce,26,01/16/2028,7323.0,s pleasant 2,,hillsboro,OR,97058,Household,Reference person,Female,Latino,2030,simulated_census_2030_125_776710,100_10000288,2
404,3978_12573,Sam,D,Thompson,57,09/21/1941,19.0,beryl street,,new york,NY,11236,Household,Other relative,Male,Black,2030,simulated_census_2030_137_26611,100_10000733,2
3853,3978_12573,Carl,D,Thompson,88,09/21/1941,19.0,beryl street,,new york,NY,11236,Household,Reference person,Male,Black,2030,simulated_census_2030_137_26608,100_10000733,2
1792,3465_41355,Samara,N,Cooley,12,09/19/2017,5014.0,242nd st sw,flat 5204,burbank,CA,94589,Household,Biological child,Female,Latino,2030,simulated_census_2030_111_86813,100_10001394,2
2455,3465_41355,Gretchen,N,Cooley,50,09/19/2017,5014.0,242nd st sw,flat 5204,burbank,CA,94589,Household,Opposite-sex spouse,Female,Latino,2030,simulated_census_2030_111_86810,100_10001394,2
1083,7817_307328,Dylan,E,Gaines,12,06/17/2017,,fox bnd ct,,grand blanc,MI,49657,Household,Sibling,Male,White,2030,simulated_census_2030_266_644676,100_10001446,2
2582,7817_535954,Dylan,E,Gaines,12,06/17/2017,615.0,sisson st,,fw,,46268,Household,Other relative,Male,White,2030,simulated_census_2030_266_1025393,100_10001446,2
2003,5548_619778,Richard,D,Martinez,75,04/19/1954,122.0,sea forest dr,,albany,NY,14845,Household,Reference person,Male,Latino,2030,simulated_census_2030_185_229434,100_10001642,2
3992,2965_369050,Richard,D,Martinez,75,04/19/1954,11561.0,highland avenue,,,OR,97333,Household,Reference person,Male,Latino,2030,simulated_census_2030_91_774182,100_10001642,2


## Ground truth statistics

In [13]:
census_2030_ground_truth = df_ops.persist(
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_census_2030_ground_truth.parquet')
)

In [14]:
# In this version of pseudopeople, there are no actual duplicates in Census,
# which means all of the duplicates identified above are wrong.
assert len(census_2030_ground_truth) == len(df_ops.drop_duplicates(census_2030_ground_truth))

In [15]:
reference_files_ground_truth = df_ops.persist(df_ops.concat([
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_geobase_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_name_dob_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
], ignore_index=True))

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    8.720000e+02
mean     4.181848e+07
std      5.982166e+04
min      4.164159e+07
25%      4.177921e+07
50%      4.182007e+07
75%      4.185967e+07
max      4.199184e+07
dtype: float64
Creating partitions of 729MB


In [16]:
# However, there can be reference file records that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(reference_files_ground_truth, by='record_id', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    1229004194
2      59018942
3       1771457
4        102373
5          6077
6           351
7            16
8             1
9             1
Name: count, dtype: int64

In [17]:
reference_files_ground_truth = df_ops.persist(reference_files_ground_truth.merge(
    n_unique_simulants,
    on='record_id',
    how='left',
))
reference_files_ground_truth.head(n=100)

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_0_1010897,1559_19394,1
1,simulated_geobase_reference_file_0_1021693,6203_718056,1
2,simulated_geobase_reference_file_0_1028097,6793_444873,1
3,simulated_geobase_reference_file_0_1047685,6554_270819,1
4,simulated_geobase_reference_file_0_1057134,9495_189946,2
...,...,...,...
95,simulated_geobase_reference_file_0_2452588,2808_1138496,1
96,simulated_geobase_reference_file_0_246654,5831_663211,1
97,simulated_geobase_reference_file_0_2568436,6090_393091,1
98,simulated_geobase_reference_file_0_258644,4922_953730,1


In [18]:
df_ops.head(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == df_ops.compute(reference_files_ground_truth.n_unique_simulants.max())])

Unnamed: 0,record_id,simulant_id,n_unique_simulants
2185016,simulated_geobase_reference_file_102_1340703,6554_95394,9
2185017,simulated_geobase_reference_file_102_1340703,6554_95393,9
2185018,simulated_geobase_reference_file_102_1340703,6554_95397,9
2185019,simulated_geobase_reference_file_102_1340703,6554_95403,9
2185020,simulated_geobase_reference_file_102_1340703,6554_95389,9
2185021,simulated_geobase_reference_file_102_1340703,6554_95392,9
2185022,simulated_geobase_reference_file_102_1340703,6554_95396,9
2185023,simulated_geobase_reference_file_102_1340703,6554_95395,9
2185024,simulated_geobase_reference_file_102_1340703,6554_95406,9


In [19]:
census_2030_ground_truth = df_ops.persist(census_2030_ground_truth.merge(
    df_ops.drop_duplicates(reference_files_ground_truth[['simulant_id']]).assign(possible_to_pik=1),
    on='simulant_id',
    how='left',
).assign(possible_to_pik=lambda df: df.possible_to_pik.fillna(0)))
possible_to_pik_proportion = df_ops.compute(census_2030_ground_truth.possible_to_pik.mean())
print(
    f'{(1 - possible_to_pik_proportion):.2%} of the input records are '
    'impossible to PIK correctly, since they are not in any reference files'
)

0.43% of the input records are impossible to PIK correctly, since they are not in any reference files


In [20]:
print(
    f'Assigned PIKs to {(piked_proportion / possible_to_pik_proportion):.2%} of PIK-able records'
)

Assigned PIKs to 89.87% of PIK-able records


In [21]:
reference_file = df_ops.concat([
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_geobase_reference_file.parquet',
    ),
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_name_dob_reference_file.parquet',
    ),
], ignore_index=True)

In [22]:
reference_file_piks = df_ops.persist(reference_file[['record_id', 'pik']])
reference_file_piks

Unnamed: 0_level_0,record_id,pik
npartitions=358,Unnamed: 1_level_1,Unnamed: 2_level_1
,string,string
,...,...
...,...,...
,...,...
,...,...


In [23]:
assert len(reference_file_piks) == len(df_ops.drop_duplicates(reference_file_piks[['record_id']]))

In [24]:
pik_simulant_pairs = df_ops.persist(df_ops.drop_duplicates(reference_files_ground_truth.merge(reference_file_piks, on='record_id')[['pik', 'simulant_id']]))

In [25]:
# However, there can be PIKs that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(pik_simulant_pairs, by='pik', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1      353923759
2       41184067
3        3257802
4         225179
5          14963
         ...    
144            1
246            1
133            1
141            1
217            1
Name: count, Length: 335, dtype: int64

In [26]:
pik_simulant_pairs = df_ops.persist(pik_simulant_pairs.merge(
    n_unique_simulants,
    on='pik',
    how='left',
))
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=740,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [27]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == df_ops.compute(pik_simulant_pairs.n_unique_simulants.max())])

Unnamed: 0,pik,simulant_id,n_unique_simulants
552933,93_9726622,1655_508547,335
552934,93_9726622,7264_376391,335
552935,93_9726622,4123_481173,335
552936,93_9726622,6520_143952,335
552937,93_9726622,1282_490678,335
552938,93_9726622,7850_493685,335
552939,93_9726622,6144_432000,335
552940,93_9726622,3377_480936,335
552941,93_9726622,8997_280916,335
552942,93_9726622,1897_265557,335


## Definitions of accuracy

1. (most strict) Assigning any PIK with multiple simulants is incorrect
2. Assigning a PIK with multiple simulants is neither incorrect nor correct (excluded from denominator)
3. (most lenient) Assigning a PIK with multiple simulants is correct, as long as at least one of those simulants matches the truth

In [28]:
# All modules, Medicare database, calculated from Layne, Wagner, and Rothhaas Table 1 (p. 15)
real_life_pvs_accuracy = 1 - (2_585 + 60_709 + 129_480 + 89_094) / (52_406_981 + 5_170_924 + 49_374_794 + 50_327_034)
f'{real_life_pvs_accuracy:.5%}'

'99.82079%'

### Definition 1

In [29]:
piks_assigned = df_ops.compute(census_2030_piked.pik.notnull().sum())
piks_assigned

306893665

In [30]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants > 1])

Unnamed: 0,pik,simulant_id,n_unique_simulants
4,100_10001131,1299_577845,2
5,100_10001131,7344_252780,2
12,100_10005841,2787_347774,2
13,100_10005841,2787_692084,2
14,100_10006089,8869_1057428,2
15,100_10006089,8869_714641,2
25,100_10013375,3984_1041763,2
26,100_10013375,3984_271827,2
77,100_10044433,3465_567846,2
78,100_10044433,8134_1070213,2


In [31]:
single_sim_piks_correct = df_ops.compute(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_piks_correct

270242236

In [32]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / piks_assigned
)

0.8805728720402228

In [33]:
assert len(confirmed_piks_with_ground_truth) == piks_assigned

In [34]:
df_ops.head(census_2030_ground_truth.rename(columns={'record_id': 'record_id_census_2030'}))

Unnamed: 0,record_id_census_2030,simulant_id,possible_to_pik
0,simulated_census_2030_0_428,28_512,1.0
1,simulated_census_2030_0_1517,28_1781,1.0
2,simulated_census_2030_0_2703,28_3203,1.0
3,simulated_census_2030_0_3657,28_4338,1.0
4,simulated_census_2030_0_4661,28_5528,1.0
5,simulated_census_2030_0_4773,28_5659,1.0
6,simulated_census_2030_0_4791,28_5679,1.0
7,simulated_census_2030_0_5737,28_6800,1.0
8,simulated_census_2030_0_6401,28_7571,1.0
9,simulated_census_2030_0_6800,28_8045,1.0


In [35]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_correct = df_ops.compute(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_record_links_correct

287197731

In [36]:
(
    single_sim_record_links_correct / piks_assigned
)

0.9358216338548403

### Definition 2

In [37]:
single_sim_piks_assigned = len(census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == 1][['pik', 'simulant_id']]))
single_sim_piks_assigned

270448733

In [38]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / single_sim_piks_assigned
)

0.9992364652712202

In [39]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_assigned = df_ops.compute(
    (confirmed_piks_with_ground_truth
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .n_unique_simulants == 1).sum()
)
single_sim_record_links_assigned

287419292

In [40]:
(
    single_sim_record_links_correct / single_sim_record_links_assigned
)

0.9992291366440357

### Definition 3

In [41]:
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=740,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [42]:
piks_at_least_partially_correct = df_ops.persist(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(df_ops.drop_duplicates)
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id", "pik"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
piks_at_least_partially_correct

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    1.480000e+03
mean     2.654147e+07
std      4.395391e+07
min      0.000000e+00
25%      0.000000e+00
50%      1.454150e+04
75%      2.244691e+07
max      2.658089e+08
dtype: float64
Creating partitions of 786MB


Unnamed: 0_level_0,record_id,pik,correct
npartitions=54,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,bool[pyarrow]
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [43]:
# Overall accuracy, treating it as a black box
piks_correct_proportion = (df_ops.compute(piks_at_least_partially_correct.correct.sum()) / piks_assigned)
piks_correct_proportion

0.9992483976493943

In [44]:
print(f'{piks_correct_proportion:.5%} of the PIKs assigned were correct; compare with {real_life_pvs_accuracy:.5%} in real life')

99.92484% of the PIKs assigned were correct; compare with 99.82079% in real life


In [45]:
# Looking at whether the exact *record* linked was from the same simulant
sim_record_links_at_least_partially_correct = df_ops.persist(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id_raw_input_file", "record_id_reference_file", "pik", "module_name", "pass_name"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
sim_record_links_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file,pik,module_name,pass_name,correct
npartitions=849,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,string,string,string,string,string,bool[pyarrow]
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [46]:
len(sim_record_links_at_least_partially_correct)

306893665

In [47]:
len(df_ops.drop_duplicates(sim_record_links_at_least_partially_correct[['record_id_raw_input_file', 'record_id_reference_file']]))

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    1.698000e+03
mean     1.641192e+07
std      2.845312e+07
min      0.000000e+00
25%      0.000000e+00
50%      1.113800e+04
75%      1.022008e+07
max      1.561296e+08
dtype: float64
Creating partitions of 557MB


306893665

In [48]:
(
    df_ops.compute(sim_record_links_at_least_partially_correct.correct.sum()) / piks_assigned
)

0.9992451685178969

In [49]:
assert df_ops.compute((df_ops.groupby_agg_small_groups(confirmed_piks_with_ground_truth, by='record_id_raw_input_file', agg_func=lambda x: x.record_id_reference_file.nunique()) <= 1).all())

In [50]:
# Using definition 3 -- at the PIK level
piks_at_least_partially_correct = df_ops.persist(
    piks_at_least_partially_correct
        .rename(columns={'record_id': 'record_id_raw_input_file'})
        .merge(confirmed_piks_with_ground_truth[['record_id_raw_input_file', 'module_name', 'pass_name']], on='record_id_raw_input_file')
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,pik,correct,module_name,pass_name
npartitions=849,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,string,string,bool[pyarrow],string,string
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [51]:
# Accuracy by module -- note that this shows the opposite pattern (with the sample data)
# relative to the results of Layne et al., who found GeoSearch was much *more* accurate
df_ops.compute(piks_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.995207,157934
namesearch,0.997353,18703967
geosearch,0.999372,287235940
hhcompsearch,0.999913,795824


In [52]:
# Accuracy by pass -- could be used to tune pass-specific cutoffs, but
# this might not be too informative while we are still using the sample data.
df_ops.compute(piks_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.995207,157934
namesearch,DOB and NYSIIS of name,0.997353,18701962
geosearch,geokey name switch,0.998792,899891
namesearch,DOB and initials,0.999002,2005
geosearch,geokey,0.999266,242653619
hhcompsearch,year of birth,0.999865,400996
geosearch,house number and street name Soundex name switch,0.999957,46424
hhcompsearch,initials,0.999962,394828
geosearch,some name and DOB information,0.999974,30018460
geosearch,house number and street name Soundex,0.999975,13617546


In [53]:
# Using definition 3 -- at the link level
df_ops.compute(sim_record_links_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.995207,157934
namesearch,0.997353,18703967
geosearch,0.999369,287235940
hhcompsearch,0.999912,795824


In [54]:
df_ops.compute(sim_record_links_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.995207,157934
namesearch,DOB and NYSIIS of name,0.997353,18701962
geosearch,geokey name switch,0.998789,899891
namesearch,DOB and initials,0.999002,2005
geosearch,geokey,0.999262,242653619
hhcompsearch,year of birth,0.999865,400996
geosearch,house number and street name Soundex name switch,0.999957,46424
hhcompsearch,initials,0.999959,394828
geosearch,some name and DOB information,0.999974,30018460
geosearch,house number and street name Soundex,0.999974,13617546


In [55]:
df_ops.compute(sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct].groupby(["module_name", "pass_name"]).size()).sort_values()

module_name   pass_name                                       
namesearch    DOB and initials                                         2
geosearch     house number and street name Soundex name switch         2
hhcompsearch  initials                                                16
              year of birth                                           54
geosearch     house number and street name Soundex                   352
dobsearch     initials name switch                                   757
geosearch     some name and DOB information                          788
              geokey name switch                                    1090
namesearch    DOB and NYSIIS of name                               49509
geosearch     geokey                                              179083
dtype: int64

### Incorrect and missed PIKs

In [56]:
incorrectly_linked_pairs = df_ops.persist(df_ops.drop_duplicates(
    sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct]
        [["record_id_raw_input_file", "record_id_reference_file"]]
))
incorrectly_linked_pairs

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count     1698.000000
mean     12469.694935
std      12490.898424
min          0.000000
25%          0.000000
50%       9535.500000
75%      24951.250000
max      28393.000000
dtype: float64
Creating partitions of 100MB


Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
,string,string
,...,...


In [57]:
len(incorrectly_linked_pairs)

231653

In [58]:
incorrect_links = df_ops.head(incorrectly_linked_pairs, n=100)
incorrect_links

Unnamed: 0,record_id_raw_input_file,record_id_reference_file
0,simulated_census_2030_0_1000428,simulated_geobase_reference_file_179_1002396
1,simulated_census_2030_0_1000771,simulated_geobase_reference_file_183_1468335
2,simulated_census_2030_0_1000862,simulated_geobase_reference_file_147_3743026
3,simulated_census_2030_0_1001175,simulated_name_dob_reference_file_17_7088846
4,simulated_census_2030_0_1001701,simulated_geobase_reference_file_21_2470024
...,...,...
95,simulated_census_2030_0_151396,simulated_geobase_reference_file_151_74598
96,simulated_census_2030_0_151692,simulated_geobase_reference_file_108_794912
97,simulated_census_2030_0_15224,simulated_geobase_reference_file_197_4064041
98,simulated_census_2030_0_152813,simulated_name_dob_reference_file_17_4241690


In [59]:
%xdel incorrectly_linked_pairs

In [60]:
comparison_cols = [
    "first_name",
    "middle_name",
    "last_name",
    "date_of_birth",
    "street_number",
    "street_name",
    "unit_number",
    "city",
    "state",
]

incorrect_links_detail = (
    incorrect_links
        .merge(
            df_ops.compute(census_2030_piked[census_2030_piked.record_id.isin(incorrect_links.record_id_raw_input_file)])
                .rename(columns={"record_id": "record_id_raw_input_file", "middle_initial": "middle_name"})
                [["record_id_raw_input_file"] + comparison_cols],
            on="record_id_raw_input_file",
            how="left",
        )
        .merge(
            df_ops.compute(reference_file[reference_file.record_id.isin(incorrect_links.record_id_reference_file)])
                .rename(columns={"record_id": "record_id_reference_file"})
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                [["record_id_reference_file"] + comparison_cols],
            on="record_id_reference_file",
            how="left",
            suffixes=("_census", "_reference_file"),
        )
)
def flatten(xss):
    return [x for xs in xss for x in xs]

incorrect_links_detail[flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])]

Unnamed: 0,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,Aurora,Dean,G,Gabriel,Harmon,Harmon,08/29/2028,20280829,5543,5543,highway 413,HIGHWAY 413,,,richmond,RICHMOND,VA,VA
1,Aurora,Luz,L,Linda,Devi,Devi,11/14/2022,20221114,2140,2140,hwy 88 w,HWY 88 W,,,winter park,WINTER PARK,FL,FL
2,,Benjamin,R,David,Gatian,Gatian,07/25/2024,20240725,606,606,lincoln woods drive,LINCOLN WOODS DRIVE,,,norfolk,NORFOLK,VA,VA
3,Mila,Mila,G,Gia,Garcia,Garcia,10/07/2028,20281007,503,,n villa st,,,,sanford,,ME,
4,Alaia,Martin,A,Alexander,Reitz,Reitz,10/12/2028,20281012,6100,6100,burdette drive,BURDETTE DRIVE,,,chicago,CHICAGO,IL,IL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Zachary,Michele,J,Julie,Goldsby,Goldsby,07/27/1966,19660727,17823-17835,17823-17835,polo pk dr,POLO PK DR,,,east peoria,EAST PEORIA,IL,IL
96,Kash,Carter,M,Man Of,Waters,Waters,02/24/2009,20090224,4243,4243,appian way w,APPIAN WAY W,,,ridgefield,RIDGEFIELD,NJ,NJ
97,Grant,Chad,B,,Nijjar,Nijjar,08/01/1943,19430801,1561,1561,hannalei pl,HANNALEI PL,,,martinez,MARTINEZ,CA,CA
98,Jackson,Jackson,B,Bodhi,Willis,Willis,12/04/2018,20181204,620,,picketts cove,,fl # 1 apt # 1010,,pine valley,,TX,


In [61]:
missed_links = df_ops.persist(
    census_2030_piked[census_2030_piked.pik.isnull()][["record_id"]]
        .merge(census_2030_ground_truth, on="record_id")
        .merge(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == 1], on="simulant_id", suffixes=("_census", "_reference_file"))
)

In [62]:
len(missed_links)

107488068

In [63]:
simulants_missed = df_ops.head(missed_links[['simulant_id']], n=100).simulant_id.unique()
simulants_missed

<ArrowStringArray>
[  '352_205030', '3454_1055553',  '1282_803576', '1482_1106062',
 '1648_1145213',  '8612_598382',  '9696_241917',   '440_818214',
 '5831_1163158',  '9643_779883', '2298_1157048',  '682_1102032',
  '9804_323231',  '4649_635246', '9911_1133450',   '278_343671',
  '7938_532559',  '6144_683864',  '9901_575637',   '6760_71186',
  '3142_100881',   '107_178364', '4950_1032017',  '1182_579036',
  '8869_544928', '9840_1053660',  '9740_349524',    '93_180423',
 '3607_1151651', '8425_1183995']
Length: 30, dtype: string

In [64]:
missed_pairs = df_ops.compute(missed_links[missed_links.simulant_id.isin(list(simulants_missed))])
missed_pairs

Unnamed: 0,record_id_census,simulant_id,possible_to_pik,record_id_reference_file,n_unique_simulants
0,simulated_census_2030_11_172699,352_205030,1.0,simulated_geobase_reference_file_35_3252330,1
1,simulated_census_2030_11_172699,352_205030,1.0,simulated_name_dob_reference_file_41_1554120,1
2,simulated_census_2030_11_172699,352_205030,1.0,simulated_geobase_reference_file_35_3252331,1
3,simulated_census_2030_110_892085,3454_1055553,1.0,simulated_geobase_reference_file_55_2082491,1
4,simulated_census_2030_110_892085,3454_1055553,1.0,simulated_name_dob_reference_file_26_3222629,1
...,...,...,...,...,...
96,simulated_census_2030_2_152033,93_180423,1.0,simulated_name_dob_reference_file_10_4494978,1
97,simulated_census_2030_122_979888,3607_1151651,1.0,simulated_geobase_reference_file_86_2343454,1
98,simulated_census_2030_122_979888,3607_1151651,1.0,simulated_name_dob_reference_file_29_5413790,1
99,simulated_census_2030_281_1010739,8425_1183995,1.0,simulated_name_dob_reference_file_42_1276576,1


In [65]:
%xdel missed_links

In [66]:
missed_links_detail = (
    missed_pairs
        .merge(
            df_ops.compute(census_2030_piked[census_2030_piked.record_id.isin(list(missed_pairs.record_id_census))])
                .rename(columns={"record_id": "record_id_census", "middle_initial": "middle_name"}),
            on="record_id_census",
        )
        .merge(
            df_ops.compute(reference_file[reference_file.record_id.isin(missed_pairs.record_id_reference_file)])
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                .rename(columns={"record_id": "record_id_reference_file"}),
            on="record_id_reference_file",
            suffixes=("_census", "_reference_file"),
        )
)

In [67]:
for simulant in simulants_missed:
    print(simulant)
    display(missed_links_detail[missed_links_detail.simulant_id == simulant][['simulant_id'] + flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])])

352_205030


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,352_205030,Shlomie,,D,Desean,Smallwood,Smallwood,22/10/1995,19951022,14308,14308.0,colby dr,,,,watertown,WATERTOWN,CT,CT
1,352_205030,Shlomie,,D,Desean,Smallwood,Smallwood,22/10/1995,19951022,14308,,colby dr,,,,watertown,,CT,
2,352_205030,Shlomie,,D,Desean,Smallwood,Smallwood,22/10/1995,19951022,14308,14308.0,colby dr,COLBY DR,,,watertown,WATERTOWN,CT,CT


3454_1055553


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
3,3454_1055553,Knowledge,Knowledge,B,Mrs,Bach,Bach,03/18/2022,20261803,,,logans run,LOGANS RUN,,,rhome,RHOME,TX,TX
4,3454_1055553,Knowledge,Knowledge,B,Mrs,Bach,Bach,03/18/2022,20261803,,,logans run,,,,rhome,,TX,


1282_803576


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
5,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,11690.0,pine ridge rd,NE TOMAHAWK ID DR,,,belleville,SAN BERNARDINO,IL,CA
6,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,2119.0,pine ridge rd,PINE RIDGE RD,,,belleville,BELLEVILLE,IL,
7,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,,pine ridge rd,,,,belleville,,IL,
8,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,6594.0,pine ridge rd,POOL ST,,,belleville,TC,IL,TX
9,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,2119.0,pine ridge rd,PINE RIDGE RD,,,belleville,BELLEVILLE,IL,IL
10,1282_803576,Jeffrey,Jeffrey,T,Thomas,Wilson,Wilson,10/19/1990,19901019,2119,,pine ridge rd,PINE RIDGE RD,,,belleville,BELLEVILLE,IL,IL


1482_1106062


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
11,1482_1106062,Moriah,Moriah,Z,Zhuri,Butler,Butler,12/29/2024,,680,,whitehall road,,,,philadelphia,,PA,
12,1482_1106062,Moriah,Moriah,Z,Zhuri,Butler,Butler,12/29/2024,,680,,whitehall road,WHITEHALL ROAD,,,philadelphia,PHILADELPHIA,PA,PA
13,1482_1106062,Moriah,Moriah,Z,Zhuri,Butler,Butler,12/29/2024,,680,680.0,whitehall road,WHITEHALL ROAD,,,philadelphia,PHILADELPHIA,PA,PA


1648_1145213


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
14,1648_1145213,Winston,Winston,L,Logan,Gilpin,Gilpin,04/16/2027,20290402,1629,385.0,hargis creek trl,LINDA VSTA R,,,lynwood,WASHINGTON,IL,DC
15,1648_1145213,Winston,Winston,L,Logan,Gilpin,Gilpin,04/16/2027,20290402,1629,,hargis creek trl,,,,lynwood,,IL,
16,1648_1145213,Winston,Winston,L,Logan,Gilpin,Gilpin,04/16/2027,20290402,1629,1629.0,hargis creek trl,HARGIS CREEK TRL,,,lynwood,LYNWOOD,IL,IL


8612_598382


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
17,8612_598382,Susanne,Susanne,A,Amanda,Tuberville,Tuberville,12/18/1999,19850505,55,55.0,,BRYAN LN,,,wayne,WAYNE,NJ,
18,8612_598382,Susanne,Susanne,A,Amanda,Tuberville,Tuberville,12/18/1999,19850505,55,55.0,,BRYAN LN,,,wayne,WAYNE,NJ,NJ
19,8612_598382,Susanne,Susanne,A,Amanda,Tuberville,Tuberville,12/18/1999,19850505,55,3931.0,,MORSE OAKS CIR,,,wayne,WEST VINCENT,NJ,PA
20,8612_598382,Susanne,Susanne,A,Amanda,Tuberville,Tuberville,12/18/1999,19850505,55,,,,,,wayne,,NJ,
21,8612_598382,Susanne,Susanne,A,Amanda,Tuberville,Tuberville,12/18/1999,19850505,55,,,BRYAN LN,,,wayne,WAYNE,NJ,NJ


9696_241917


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
22,9696_241917,Abigail,Abigail,C,Cynthia,Stull,Stull,11/09/6005,20051109,2812,2812.0,rainbow ht road,RAINBOW HT ROAD,,,floyds knobs,FLOVDS KNOBS,IN,IN
23,9696_241917,Abigail,Abigail,C,Cynthia,Stull,Stull,11/09/6005,20051109,2812,2812.0,rainbow ht road,RAINBOW HT ROAD,,,floyds knobs,FLOYDS KNOBS,IN,DC
24,9696_241917,Abigail,Abigail,C,Cynthia,Stull,Stull,11/09/6005,20051109,2812,2812.0,rainbow ht road,RAINBOW HT ROAD,,,floyds knobs,FLOYDS KNOBS,IN,IN
25,9696_241917,Abigail,Abigail,C,Cynthia,Stull,Stull,11/09/6005,20051109,2812,,rainbow ht road,,,,floyds knobs,,IN,
26,9696_241917,Abigail,Abigail,C,Cynthia,Stull,Stull,11/09/6005,20051109,2812,2812.0,rainbow ht road,RA:NBOW HT ROAD,,,floyds knobs,FLOYDS KNOBS,IN,IN


440_818214


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
27,440_818214,Sara,Sara,M,M,Declined,Ingram,04/14/1976,,543,,grandview rd,,,,topeka,,KS,
28,440_818214,Sara,Sara,M,M,Declined,Ingr,04/14/1976,,543,,grandview rd,,,,topeka,,KS,
29,440_818214,Sara,Sara,M,M,Declined,Ingram,04/14/1976,,543,543.0,grandview rd,GRANDVIEW RD,,,topeka,TOPEKA,KS,KS
30,440_818214,Sara,Sara,M,M,Declined,Ingr,04/14/1976,,543,543.0,grandview rd,GRANDVIEW RD,,,topeka,TOPEKA,KS,KS


5831_1163158


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
31,5831_1163158,Aylin,Aylin,A,N,Pellecer-Ramirez,Pellecer-Ramirez,04/24/2028,19870117,1557,,second st,,,,pacoima,,CA,
32,5831_1163158,Aylin,Aylin,A,N,Pellecer-Ramirez,Pellecer-Ramirez,04/24/2028,19870117,1557,1557.0,second st,SECOND ST,,,pacoima,PACOIMA,CA,CA


9643_779883


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
33,9643_779883,James,James,D,D,Bautista,Baut,02/19/1959,,81,,cr 6022b,,,,elizabeth,,NJ,
34,9643_779883,James,James,D,D,Bautista,Baut,02/19/1959,,81,81.0,cr 6022b,CR 6022B,,,elizabeth,ELIZABETH,NJ,NJ


2298_1157048


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
35,2298_1157048,Elias,Elias,M,M,Larson,Larson,08/16/2003,,223,,farmington place,,,,saint louis,,MO,
36,2298_1157048,Elias,Elias,M,M,Larson,Larson,08/16/2003,,223,223.0,farmington place,FARMINGTON PLACE,,,saint louis,SAINT LOUIS,MO,MO
37,2298_1157048,Elias,Elias,M,M,Larson,Larson,08/16/2003,,223,223.0,farmington place,FARMINGGON PLACE,,,saint louis,SAINT LOUIS,MO,MO


682_1102032


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
38,682_1102032,Nicholas,Nicholas,M,Milo,Holden,Holden,,20241106,1150,1150.0,dwns rd,,,,taylor,TAYLOR,MI,MI
39,682_1102032,Nicholas,Nicholas,M,Milo,Holden,Holden,,20241106,1150,1414.0,dwns rd,EISENHOWER AVN,,,taylor,MOBILE,MI,AL
40,682_1102032,Nicholas,Nicholas,M,Milo,Holden,Holden,,20241106,1150,,dwns rd,,,,taylor,,MI,


9804_323231


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
41,9804_323231,A,Olivia,Aboud,Amy,Olivia,Aboud,11/30/2017,20171130,940,940.0,colby street,COLBY STREET,,,montgomery,MONTGOMERY,AL,AL
42,9804_323231,A,Olivia,Aboud,Amy,Olivia,Aboud,11/30/2017,20171130,940,,colby street,,,,montgomery,,AL,


4649_635246


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
43,4649_635246,Connie,Connie,A,Anna,Kurniawan,Kurniawan,04/10/1976,19761004,1992,,jerald ave,,,,st louis,,MO,
44,4649_635246,Connie,Connie,A,Anna,Kurniawan,Kurniawan,04/10/1976,19761004,1992,1992.0,jerald ave,JERALD AVE,,,st louis,ST LOUIS,MO,MO


9911_1133450


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
45,9911_1133450,Horacio,Horacio,T,T,Varghese,Varghese,09/04/1974,,134,,s wasatch dr,,,,chatham,,NJ,
46,9911_1133450,Horacio,Horacio,T,T,Varghese,Varg,09/04/1974,,134,134.0,s wasatch dr,S WASATCH DR,,,chatham,CHATHAM,NJ,NJ
47,9911_1133450,Horacio,Horacio,T,T,Varghese,Varghese,09/04/1974,,134,134.0,s wasatch dr,S WASATCH DR,,,chatham,CHATHAM,NJ,NJ
48,9911_1133450,Horacio,Horacio,T,T,Varghese,Varg,09/04/1974,,134,,s wasatch dr,,,,chatham,,NJ,


278_343671


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
49,278_343671,Robin,Robin,K,Kaela,Whall,Whall,04/15/1891,19910415,189,189.0,northeast faloma road,NORTHEAST FALOMA ROAD,,,lincoln,LINCOLN,NE,NE
50,278_343671,Robin,Robin,K,Kaela,Whall,Whall,04/15/1891,19910415,189,4469.0,northeast faloma road,W BUTLER ST,,,lincoln,MEMPHIS,NE,TN
51,278_343671,Robin,Robin,K,Kaela,Whall,Whall,04/15/1891,19910415,189,,northeast faloma road,,,,lincoln,,NE,
52,278_343671,Robin,Robin,K,Kaela,Whall,Whall,04/15/1891,19910415,189,4469.0,northeast faloma road,W BUTLER ST,,,lincoln,MEMPHIS,NE,TN


7938_532559


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
53,7938_532559,J,Lillian,Lillian,Jessica,Alfaro,Alfaro,10/21/2003,20031021,980,980.0,e movie ranch rd,E MOVIE RANCH RD,,,murfreesboro,MURFREESBORO,TN,TN
54,7938_532559,J,Lillian,Lillian,Jessica,Alfaro,Alfaro,10/21/2003,20031021,980,,e movie ranch rd,,,,murfreesboro,,TN,
55,7938_532559,J,Lillian,Lillian,Jessica,Alfaro,Alfaro,10/21/2003,20031021,980,9418.0,e movie ranch rd,PAYETTE DR,,,murfreesboro,HYDETOWN BORO,TN,PA


6144_683864


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
56,6144_683864,Charles,Charles,C,Cruz,Ortiz,Ortiz,,20150811,5516,,provincial way,,,,east brainerd,,TN,
57,6144_683864,Charles,Charles,C,Cruz,Ortiz,Ortiz,,20150811,5516,7195.0,provincial way,SW ALGER AVE,,APRT # 51 E,east brainerd,CAPE CORAL,TN,FL
58,6144_683864,Charles,Charles,C,Cruz,Ortiz,Ortiz,,20150811,5516,7195.0,provincial way,SW ALGER AVE,,APRT # 51 E,east brainerd,CAPE CORAL,TN,FL


9901_575637


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
59,9901_575637,Simon,Simon,M,Matthew,Perillo,Perillo,04/01/1962,20160907,609,,ivy wy,,,,mesa,,AZ,
60,9901_575637,Simon,Simon,M,Matthew,Perillo,Perillo,04/01/1962,20160907,609,609.0,ivy wy,IVY WY,,,mesa,MESA,AZ,AZ
61,9901_575637,Simon,Simon,M,Matthew,Perillo,Perillo,04/01/1962,20160907,609,609.0,ivy wy,IVY WY,,,mesa,MESA,AZ,CT


6760_71186


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
62,6760_71186,Heather,Heather,T,T,Pachco,Pach,04/17/1992,,1020,,park bluff way,,,,bakersfield,,CA,
63,6760_71186,Heather,Heather,T,T,Pachco,Pacheco,04/17/1992,,1020,,park bluff way,,,,bakersfield,,CA,
64,6760_71186,Heather,Heather,T,T,Pachco,Pacheco,04/17/1992,,1020,1020,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
65,6760_71186,Heather,Heather,T,T,Pachco,Pach,04/17/1992,,1020,10Z0,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
66,6760_71186,Heather,Heather,T,T,Pachco,Girl,04/17/1992,,1020,10Z0,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
67,6760_71186,Heather,Heather,T,T,Pachco,Pacheco,04/17/1992,,1020,10Z0,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
68,6760_71186,Heather,Heather,T,T,Pachco,Girl,04/17/1992,,1020,1020,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
69,6760_71186,Heather,Heather,T,T,Pachco,Pach,04/17/1992,,1020,1020,park bluff way,PARK BLUFF WAY,,,bakersfield,BAKERSFIELD,CA,CA
70,6760_71186,Heather,Heather,T,T,Pachco,Girl,04/17/1992,,1020,,park bluff way,,,,bakersfield,,CA,


3142_100881


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
71,3142_100881,Molly,Molly,C,C,Norton,Nort,02/10/1978,,2214,,bocaw place,,,,san diego,,CA,
72,3142_100881,Molly,Molly,C,C,Norton,Nort,02/10/1978,,2214,2214.0,bocaw place,BOCAW PLACE,,,san diego,SAN DIEGO,CA,CA


107_178364


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
73,107_178364,Brigette,Brigette,S,Savanna,Ito,Ito,05/03/2014,50120503,6,,frayne dr,,,,san jose,GLENDALE,CA,CA
74,107_178364,Brigette,Brigette,S,Savanna,Ito,Ito,05/03/2014,50120503,6,,frayne dr,,,,san jose,,CA,


4950_1032017


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
75,4950_1032017,Madison,Madison,C,C,Nagarajan,Naga,08/10/1989,,215,18.0,heatherfield east dr,E HAMPTON AVE,,,croghan,SPRING CREEK HOUSING,NY,NV
76,4950_1032017,Madison,Madison,C,C,Nagarajan,Naga,08/10/1989,,215,,heatherfield east dr,,,,croghan,,NY,


1182_579036


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
77,1182_579036,Mirian,,L,Leah,Stewart,Stewart,07/07/2002,20020707,105,4010.0,sweetwater rd,S BOLD FORBES BLVD,,,rochester,DEMING,NY,WA
78,1182_579036,Mirian,,L,Leah,Stewart,Stewart,07/07/2002,20020707,105,,sweetwater rd,,,,rochester,,NY,


8869_544928


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
79,8869_544928,Shawn,Shawn,G,G,,Mcclaskey,11/26/1984,,2525,2525.0,mccall bri rd,MCCALL BRI RD,,,e st louis,E ST LOUIS,IL,IL
80,8869_544928,Shawn,Shawn,G,G,,Mccl,11/26/1984,,2525,2525.0,mccall bri rd,MCCALL BRI RD,,,e st louis,E ST LOUIS,IL,IL
81,8869_544928,Shawn,Shawn,G,G,,Mccl,11/26/1984,,2525,2525.0,mccall bri rd,MCCALL BRI RD,,,e st louis,E ST LOUIS,IL,IL
82,8869_544928,Shawn,Shawn,G,G,,Mcclaskey,11/26/1984,,2525,2525.0,mccall bri rd,MCCALL BRI RD,,,e st louis,E ST LOUIS,IL,IL
83,8869_544928,Shawn,Shawn,G,G,,Mcclaskey,11/26/1984,,2525,,mccall bri rd,,,,e st louis,,IL,
84,8869_544928,Shawn,Shawn,G,G,,Mccl,11/26/1984,,2525,,mccall bri rd,,,,e st louis,,IL,


9840_1053660


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
85,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Daniel,11/17/2000,,,,hillside rd,,,,duarte,,CA,
86,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Daniel,11/17/2000,,,,hillside rd,HILLSIDE RD,,,duarte,DUARTE,CA,CA
87,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Daniel,11/17/2000,,,2532.0,hillside rd,N ARMOUR ST,,,duarte,CHESTERFIELD,CA,MO
88,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Dani,11/17/2000,,,2532.0,hillside rd,N ARMOUR ST,,,duarte,CHESTERFIELD,CA,MO
89,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Dani,11/17/2000,,,,hillside rd,HILLSIDE RD,,,duarte,DUARTE,CA,CA
90,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Daniel,11/17/2000,,,,hillside rd,HILLSIDE RD,,,duarte,DUARTE,CA,CA
91,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Dani,11/17/2000,,,,hillside rd,,,,duarte,,CA,
92,9840_1053660,Rosalva,Rosalva,D,D,Daniel,Daniel,11/17/2000,,,,hillside rd,,,,duarte,,CA,


9740_349524


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
93,9740_349524,Jacob,Lady Of House,C,Calvin,Zak,Zak,10/35/1984,19841031,43524,,krouse ct,,,,plano,,TX,
94,9740_349524,Jacob,Lady Of House,C,Calvin,Zak,Zak,10/35/1984,19841031,43524,43524.0,krouse ct,KROUSE CT,,,plano,PLANO,TX,TX


93_180423


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
95,93_180423,Quanesha,Quanesha,Z,Zoe,Lloyd,Lbyd,10/05/1991,19911005,2201,2201.0,wildwood drive,WILDWOOD DRIVE,,,wbarton,WHARTON,TX,TX
96,93_180423,Quanesha,Quanesha,Z,Zoe,Lloyd,Lbyd,10/05/1991,19911005,2201,,wildwood drive,,,,wbarton,,TX,


3607_1151651


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
97,3607_1151651,G,Owen,Natarajan,Giovanni,Owen,Natarajan,08/03/2027,20270803,2575,2575.0,co rd 16,CO RD 16,,,pukalani,PUKALANI,HI,HI
98,3607_1151651,G,Owen,Natarajan,Giovanni,Owen,Natarajan,08/03/2027,20270803,2575,,co rd 16,,,,pukalani,,HI,


8425_1183995


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
99,8425_1183995,Harper,Harper,H,Harper,Decotis,Decotis,053/08/2017,20170308,1065,,hillside av,,,,norman,,OK,
100,8425_1183995,Harper,Harper,H,Harper,Decotis,Decotis,053/08/2017,20170308,1065,5065.0,hillside av,HILLSIDE AV,,,norman,NORMAN,OK,OK
