# Simulated PIK statistics

Here we inspect the accuracy and characteristics of the PIKs assigned,
leveraging our knowledge of ground truth from pseudopeople.

It wouldn't be possible to do the ground truth part with the real PVS, but
Layne, Wagner, and Rothhaas did something similar by redacting SSN from real records,
sending them through PVS without the SSN, and then using the true SSN
as ground truth.
The health care records they used are probably quite different from a CUF,
but they found a **very** good overall PIK accuracy (see cell below).

In [1]:
# Query planning is now on by default, but it has some rough edges.
# See https://github.com/dask/dask/issues/10995 for general discussion
# and https://github.com/dask/dask-expr/issues/1060 for the particular
# issue I ran into.
import dask
dask.config.set({"dataframe.query-planning": False})

<dask.config.set at 0x7f3024392d70>

In [2]:
import datetime, os

from vivarium_research_prl import distributed_compute, utils
from IPython.display import display

In [3]:
print(datetime.datetime.now())

2024-05-13 11:02:49.856938


In [4]:
# DO NOT EDIT if this notebook is not called ground_truth_accuracy.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
data_to_use = 'small_sample'
simulated_data_output_dir = 'output/generate_simulated_data'
case_study_output_dir = 'output'

# The "compute engine" is what we use on the Python side
# for our case-study-specific operations,
# as opposed to the Splink engine
compute_engine = 'pandas'
# Only matter if using a distributed compute engine
compute_engine_num_jobs = 3
compute_engine_cpus_per_job = 2
compute_engine_memory_per_job = "5GB"
queue = "long.q"
local_directory = f"/tmp/{os.environ['USER']}_dask"

In [5]:
# Parameters
data_to_use = "usa"
simulated_data_output_dir = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study/generate_simulated_data/"
case_study_output_dir = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study/person_linkage_case_study/"
compute_engine = "dask"
compute_engine_num_jobs = 50
compute_engine_memory_per_job = "40GB"
compute_engine_cpus_per_job = 2
local_directory = "/mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local"


In [6]:
# Parameters for a USA run
# data_to_use = "usa"
# simulated_data_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data"
# case_study_output_dir = "/ihme/scratch/users/zmbc/person_linkage_case_study/person_linkage_case_study"

# compute_engine = 'dask'
# compute_engine_num_jobs = 50
# compute_engine_memory_per_job = "120GB"
# compute_engine_cpus_per_job = 2

In [7]:
case_study_output_dir = f'{case_study_output_dir}/{data_to_use}'
simulated_data_output_dir = f'{simulated_data_output_dir}/{data_to_use}'

In [8]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=compute_engine_num_jobs,
    cpus_per_job=compute_engine_cpus_per_job,
    memory_per_job=compute_engine_memory_per_job,
    queue=queue,
    local_directory=local_directory,
)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 45801 instead


0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.111.9:45801/status,

0,1
Dashboard: http://10.158.111.9:45801/status,Workers: 50
Total threads: 50,Total memory: 1.82 TiB

0,1
Comm: tcp://10.158.111.9:41223,Workers: 50
Dashboard: http://10.158.111.9:45801/status,Total threads: 50
Started: Just now,Total memory: 1.82 TiB

0,1
Comm: tcp://10.158.148.235:37493,Total threads: 1
Dashboard: http://10.158.148.235:40807/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:37251,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-j8_mcz__,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-j8_mcz__

0,1
Comm: tcp://10.158.148.223:40945,Total threads: 1
Dashboard: http://10.158.148.223:35321/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:43797,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-pmcrmx63,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-pmcrmx63

0,1
Comm: tcp://10.158.148.223:44193,Total threads: 1
Dashboard: http://10.158.148.223:35481/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:41757,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-_096jiy8,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-_096jiy8

0,1
Comm: tcp://10.158.148.142:37405,Total threads: 1
Dashboard: http://10.158.148.142:36453/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:34863,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-f4jj7a4n,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-f4jj7a4n

0,1
Comm: tcp://10.158.148.240:36167,Total threads: 1
Dashboard: http://10.158.148.240:44615/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:36965,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-qna6twgj,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-qna6twgj

0,1
Comm: tcp://10.158.148.235:33725,Total threads: 1
Dashboard: http://10.158.148.235:42263/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:44193,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-7954kg65,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-7954kg65

0,1
Comm: tcp://10.158.148.167:39323,Total threads: 1
Dashboard: http://10.158.148.167:42117/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.167:33323,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-gg3z0xy8,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-gg3z0xy8

0,1
Comm: tcp://10.158.148.240:38217,Total threads: 1
Dashboard: http://10.158.148.240:41133/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:46493,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-vphbq_z2,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-vphbq_z2

0,1
Comm: tcp://10.158.96.28:44225,Total threads: 1
Dashboard: http://10.158.96.28:42419/status,Memory: 37.25 GiB
Nanny: tcp://10.158.96.28:42169,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-b8xfefyz,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-b8xfefyz

0,1
Comm: tcp://10.158.148.174:32835,Total threads: 1
Dashboard: http://10.158.148.174:45915/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.174:44025,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-oerda7_1,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-oerda7_1

0,1
Comm: tcp://10.158.111.9:39943,Total threads: 1
Dashboard: http://10.158.111.9:39351/status,Memory: 37.25 GiB
Nanny: tcp://10.158.111.9:46123,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-vabyleq1,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-vabyleq1

0,1
Comm: tcp://10.158.148.230:35001,Total threads: 1
Dashboard: http://10.158.148.230:40833/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.230:34183,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-pxvo4cxq,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-pxvo4cxq

0,1
Comm: tcp://10.158.148.235:45769,Total threads: 1
Dashboard: http://10.158.148.235:35345/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:35975,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-0e0slr1n,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-0e0slr1n

0,1
Comm: tcp://10.158.148.235:42071,Total threads: 1
Dashboard: http://10.158.148.235:46741/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:38957,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-mbzvfvjl,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-mbzvfvjl

0,1
Comm: tcp://10.158.148.167:37237,Total threads: 1
Dashboard: http://10.158.148.167:38765/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.167:33363,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-oy4hco_i,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-oy4hco_i

0,1
Comm: tcp://10.158.148.142:43881,Total threads: 1
Dashboard: http://10.158.148.142:33525/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:39109,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-dnh0q5gm,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-dnh0q5gm

0,1
Comm: tcp://10.158.148.240:43013,Total threads: 1
Dashboard: http://10.158.148.240:33127/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:36469,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-0vk5catr,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-0vk5catr

0,1
Comm: tcp://10.158.148.142:46469,Total threads: 1
Dashboard: http://10.158.148.142:41925/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:43415,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-j4v5vdjp,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-j4v5vdjp

0,1
Comm: tcp://10.158.148.232:45441,Total threads: 1
Dashboard: http://10.158.148.232:40979/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.232:46241,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-n2bjh_zj,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-n2bjh_zj

0,1
Comm: tcp://10.158.148.223:37211,Total threads: 1
Dashboard: http://10.158.148.223:39045/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:34389,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-3w1j5m65,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-3w1j5m65

0,1
Comm: tcp://10.158.148.223:34043,Total threads: 1
Dashboard: http://10.158.148.223:42345/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:36439,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-5v9owq5v,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-5v9owq5v

0,1
Comm: tcp://10.158.148.235:46433,Total threads: 1
Dashboard: http://10.158.148.235:44605/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:42921,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-khw16ky1,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-khw16ky1

0,1
Comm: tcp://10.158.148.235:37173,Total threads: 1
Dashboard: http://10.158.148.235:35931/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:36477,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-f1mlgwly,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-f1mlgwly

0,1
Comm: tcp://10.158.148.142:44567,Total threads: 1
Dashboard: http://10.158.148.142:46147/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:33401,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-msr2g0t2,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-msr2g0t2

0,1
Comm: tcp://10.158.148.240:44039,Total threads: 1
Dashboard: http://10.158.148.240:40891/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:41371,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-t0apnxgz,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-t0apnxgz

0,1
Comm: tcp://10.158.148.223:39399,Total threads: 1
Dashboard: http://10.158.148.223:40851/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:45095,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-4c5qwqoq,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-4c5qwqoq

0,1
Comm: tcp://10.158.96.28:38809,Total threads: 1
Dashboard: http://10.158.96.28:34245/status,Memory: 37.25 GiB
Nanny: tcp://10.158.96.28:41873,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-bzdg2b7e,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-bzdg2b7e

0,1
Comm: tcp://10.158.148.223:42549,Total threads: 1
Dashboard: http://10.158.148.223:34753/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:45517,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-e307yfj4,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-e307yfj4

0,1
Comm: tcp://10.158.148.223:42587,Total threads: 1
Dashboard: http://10.158.148.223:41039/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:34325,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-8g6f65o0,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-8g6f65o0

0,1
Comm: tcp://10.158.148.235:43131,Total threads: 1
Dashboard: http://10.158.148.235:39913/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:34859,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-87wqpfqu,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-87wqpfqu

0,1
Comm: tcp://10.158.148.223:34981,Total threads: 1
Dashboard: http://10.158.148.223:43635/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:42793,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-mqfrb7k5,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-mqfrb7k5

0,1
Comm: tcp://10.158.148.240:40427,Total threads: 1
Dashboard: http://10.158.148.240:34867/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:43381,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-zorao1_y,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-zorao1_y

0,1
Comm: tcp://10.158.148.235:34453,Total threads: 1
Dashboard: http://10.158.148.235:42093/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:33527,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-aq1nnnst,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-aq1nnnst

0,1
Comm: tcp://10.158.148.240:43039,Total threads: 1
Dashboard: http://10.158.148.240:46055/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:45777,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-hjxr8uke,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-hjxr8uke

0,1
Comm: tcp://10.158.148.230:44295,Total threads: 1
Dashboard: http://10.158.148.230:42533/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.230:45257,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-e1hltl4p,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-e1hltl4p

0,1
Comm: tcp://10.158.148.223:39323,Total threads: 1
Dashboard: http://10.158.148.223:41629/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:33619,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-o36874rh,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-o36874rh

0,1
Comm: tcp://10.158.148.223:41805,Total threads: 1
Dashboard: http://10.158.148.223:40819/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:33325,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-1dyshgxf,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-1dyshgxf

0,1
Comm: tcp://10.158.96.7:42067,Total threads: 1
Dashboard: http://10.158.96.7:34231/status,Memory: 37.25 GiB
Nanny: tcp://10.158.96.7:39139,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-793sbqtp,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-793sbqtp

0,1
Comm: tcp://10.158.148.235:38799,Total threads: 1
Dashboard: http://10.158.148.235:35643/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:39313,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-o2mw91t3,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-o2mw91t3

0,1
Comm: tcp://10.158.111.9:37029,Total threads: 1
Dashboard: http://10.158.111.9:46789/status,Memory: 37.25 GiB
Nanny: tcp://10.158.111.9:35437,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-l484lpzm,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-l484lpzm

0,1
Comm: tcp://10.158.148.235:41961,Total threads: 1
Dashboard: http://10.158.148.235:44879/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:37223,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-6wf_juj4,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-6wf_juj4

0,1
Comm: tcp://10.158.148.240:40745,Total threads: 1
Dashboard: http://10.158.148.240:38849/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.240:42221,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-zfftft7j,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-zfftft7j

0,1
Comm: tcp://10.158.148.223:44733,Total threads: 1
Dashboard: http://10.158.148.223:46513/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:37765,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-csrnfjcq,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-csrnfjcq

0,1
Comm: tcp://10.158.148.142:36801,Total threads: 1
Dashboard: http://10.158.148.142:37655/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:41801,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-ufb0sg8k,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-ufb0sg8k

0,1
Comm: tcp://10.158.148.142:42983,Total threads: 1
Dashboard: http://10.158.148.142:42507/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.142:40099,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-hmdh7iue,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-hmdh7iue

0,1
Comm: tcp://10.158.96.28:45121,Total threads: 1
Dashboard: http://10.158.96.28:33967/status,Memory: 37.25 GiB
Nanny: tcp://10.158.96.28:33859,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-qot02rzt,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-qot02rzt

0,1
Comm: tcp://10.158.111.9:44997,Total threads: 1
Dashboard: http://10.158.111.9:40959/status,Memory: 37.25 GiB
Nanny: tcp://10.158.111.9:42281,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-cuf0n82k,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-cuf0n82k

0,1
Comm: tcp://10.158.148.235:40489,Total threads: 1
Dashboard: http://10.158.148.235:40051/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.235:33185,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-4gner_64,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-4gner_64

0,1
Comm: tcp://10.158.148.223:42929,Total threads: 1
Dashboard: http://10.158.148.223:35877/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.223:40743,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-ho40w6_s,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-ho40w6_s

0,1
Comm: tcp://10.158.148.174:41431,Total threads: 1
Dashboard: http://10.158.148.174:45931/status,Memory: 37.25 GiB
Nanny: tcp://10.158.148.174:36803,
Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-2o43hzdw,Local directory: /mnt/team/simulation_science/priv/users/zmbc/prl/person_linkage_case_study_tmp/dask_local_run/dask_local/dask-scratch-space/worker-2o43hzdw


In [9]:
census_2030_piked = df_ops.read_parquet(f'{case_study_output_dir}/census_2030_piked.parquet')
confirmed_piks_with_ground_truth = df_ops.read_parquet(f'{case_study_output_dir}/confirmed_piks.parquet')

In [10]:
piked_proportion = df_ops.compute(census_2030_piked.pik.notnull().mean())
# Compare with 90.28% of input records PIKed in the 2010 CUF,
# as reported in Wagner and Layne, Table 2, p. 18 
print(f'{piked_proportion:.2%} of the input records were PIKed')

88.62% of the input records were PIKed


In [11]:
# Multiple Census rows assigned the same PIK, indicating the model thinks they are duplicates in Census
pik_sizes = df_ops.persist(df_ops.groupby_agg_small_groups(census_2030_piked, by='pik', agg_func=lambda x: x.size()))
df_ops.compute(pik_sizes.value_counts())

1    302545510
2       684065
3          668
4            4
Name: count, dtype: int64

In [12]:
# Interesting: in pseudopeople, sometimes siblings are assigned the same (common) first name, making them almost identical.
# The only giveaway is their age and DOB.
# Presumably, this tends not to happen in real life.
duplicate_piks = pik_sizes.rename('pik_size').reset_index().pipe(lambda df: df[df.pik_size > 1])

df_ops.head(census_2030_piked.merge(duplicate_piks, on="pik").sort_values('pik'))

Unnamed: 0,household_id,first_name,middle_initial,last_name,age,date_of_birth,street_number,street_name,unit_number,city,state,zipcode,housing_type,relationship_to_reference_person,sex,race_ethnicity,year,record_id,pik,pik_size
2009,93_411256,Austin,L,Carroll,11,12/04/2018,3701,albemarle dri,,pleasant grove,AL,35654,Household,Other relative,Male,White,2030,simulated_census_2030_2_1025398,0_1000309,2
2705,93_59123,Austin,L,Carroll,11,12/04/2018,1252,w 72nd ave,,rochester,NY,10075,Household,Reference person,Male,White,2030,simulated_census_2030_2_124106,0_1000309,2
1413,6800_184521,Kinley,M,Murray,16,03/11/2014,12801,park plza,,st. peters,MO,63115,Household,Biological child,Female,White,2030,simulated_census_2030_233_388109,0_10005,2
3004,6800_487914,Kinley,M,Murray,16,03/11/2014,671,new hope rd,,englewood,CO,80916,Household,Other relative,Female,White,2030,simulated_census_2030_233_1027806,0_10005,2
1431,6790_298700,David,J,Barrick,68,10/24/1961,345,iefferson ave,,,MO,65441,Household,Opposite-sex spouse,Male,White,2030,simulated_census_2030_231_627034,0_1000919,2
3685,6790_298700,James,J,Barrick,36,10/24/1961,345,jefferson ave,,jefferson city,MO,65441,Household,Adopted child,Male,White,2030,simulated_census_2030_231_627035,0_1000919,2
3753,8527_376719,Jacob,J,Tapley,17,04/20/2012,1810,a st southeast,,fairborn,OH,45417,Household,Other nonrelative,Male,White,2030,simulated_census_2030_285_789162,0_1001605,2
3957,8527_571898,Jacob,J,Tapley,17,04/20/2012,314,hicks hill road,,florissant,MO,64068,Household,Other relative,Male,White,2030,simulated_census_2030_285_1025935,0_1001605,2
1000,1362_3,Emily,S,Elwell,21,12/17/2008,505,e gn st,,elk point,SD,57029,College,Noninstitutionalized group quarters population,Female,Black,2030,simulated_census_2030_46_47324,0_1001652,2
3748,1362_22687,Emily,S,Elwell,21,12/17/2008,127,tri st,,addison,IL,60555,Household,Other relative,Female,Black,2030,simulated_census_2030_46_1025428,0_1001652,2


## Ground truth statistics

In [13]:
census_2030_ground_truth = df_ops.persist(
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_census_2030_ground_truth.parquet')
)

In [14]:
# In this version of pseudopeople, there are no actual duplicates in Census,
# which means all of the duplicates identified above are wrong.
assert len(census_2030_ground_truth) == len(df_ops.drop_duplicates(census_2030_ground_truth))

In [15]:
reference_files_ground_truth = df_ops.persist(df_ops.concat([
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_geobase_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
    df_ops.read_parquet(f'{simulated_data_output_dir}/simulated_name_dob_reference_file_ground_truth.parquet').drop(columns=['n_unique_simulants']),
], ignore_index=True))

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    8.720000e+02
mean     1.215106e+08
std      4.555555e+07
min      7.575983e+07
25%      7.598293e+07
50%      1.214460e+08
75%      1.670462e+08
max      1.673512e+08
dtype: float64
Creating partitions of 2,119MB


In [16]:
# However, there can be reference file records that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(reference_files_ground_truth, by='record_id', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1    1229041214
2      59021232
3       1771663
4        102952
5          6057
6           355
7            28
9             1
Name: count, dtype: int64

In [17]:
reference_files_ground_truth = df_ops.persist(reference_files_ground_truth.merge(
    n_unique_simulants,
    on='record_id',
    how='left',
))
reference_files_ground_truth.head(n=100)

Unnamed: 0,record_id,simulant_id,n_unique_simulants
0,simulated_geobase_reference_file_0_1007416,3202_98373,1
1,simulated_geobase_reference_file_0_1009259,2416_1130696,1
2,simulated_geobase_reference_file_0_101754,6539_783854,1
3,simulated_geobase_reference_file_0_1030795,9272_24332,1
4,simulated_geobase_reference_file_0_1059381,5698_1116162,1
...,...,...,...
95,simulated_geobase_reference_file_0_486819,1753_69067,1
96,simulated_geobase_reference_file_0_499302,5949_192171,2
97,simulated_geobase_reference_file_0_499302,4344_791440,2
98,simulated_geobase_reference_file_0_531999,2047_838586,1


In [18]:
df_ops.head(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == df_ops.compute(reference_files_ground_truth.n_unique_simulants.max())])

Unnamed: 0,record_id,simulant_id,n_unique_simulants
12778090,simulated_geobase_reference_file_369_1409656,6554_95393,9
12778091,simulated_geobase_reference_file_369_1409656,6554_95397,9
12778092,simulated_geobase_reference_file_369_1409656,6554_95392,9
12778093,simulated_geobase_reference_file_369_1409656,6554_95390,9
12778094,simulated_geobase_reference_file_369_1409656,6554_95396,9
12778095,simulated_geobase_reference_file_369_1409656,6554_95389,9
12778096,simulated_geobase_reference_file_369_1409656,6554_95403,9
12778097,simulated_geobase_reference_file_369_1409656,6554_95406,9
12778098,simulated_geobase_reference_file_369_1409656,6554_95395,9


In [19]:
census_2030_ground_truth = df_ops.persist(census_2030_ground_truth.merge(
    df_ops.drop_duplicates(reference_files_ground_truth[['simulant_id']]).assign(possible_to_pik=1),
    on='simulant_id',
    how='left',
).assign(possible_to_pik=lambda df: df.possible_to_pik.fillna(0)))
possible_to_pik_proportion = df_ops.compute(census_2030_ground_truth.possible_to_pik.mean())
print(
    f'{(1 - possible_to_pik_proportion):.2%} of the input records are '
    'impossible to PIK correctly, since they are not in any reference files'
)

0.43% of the input records are impossible to PIK correctly, since they are not in any reference files


In [20]:
print(
    f'Assigned PIKs to {(piked_proportion / possible_to_pik_proportion):.2%} of PIK-able records'
)

Assigned PIKs to 89.00% of PIK-able records


In [21]:
reference_file = df_ops.concat([
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_geobase_reference_file.parquet',
    ),
    df_ops.read_parquet(
        f'{simulated_data_output_dir}/simulated_name_dob_reference_file.parquet',
    ),
], ignore_index=True)

Imbalanced dataframe: too_few=False, too_many=True, too_large=False
count    7.700000e+02
mean     3.880555e+08
std      1.573591e+08
min      2.302881e+08
25%      2.308000e+08
50%      3.859436e+08
75%      5.455854e+08
max      5.482611e+08
dtype: float64
Creating partitions of 5,976MB


In [22]:
reference_file_piks = df_ops.persist(reference_file[['record_id', 'pik']])
reference_file_piks

Unnamed: 0_level_0,record_id,pik
npartitions=54,Unnamed: 1_level_1,Unnamed: 2_level_1
,string,string
,...,...
...,...,...
,...,...
,...,...


In [23]:
assert len(reference_file_piks) == len(df_ops.drop_duplicates(reference_file_piks[['record_id']]))

In [24]:
pik_simulant_pairs = df_ops.persist(df_ops.drop_duplicates(reference_files_ground_truth.merge(reference_file_piks, on='record_id')[['pik', 'simulant_id']]))

In [25]:
# However, there can be PIKs that correspond to multiple simulants,
# due to errors in the reference file construction by SSN
n_unique_simulants = df_ops.persist(df_ops.groupby_agg_small_groups(pik_simulant_pairs, by='pik', agg_func=lambda x: x.simulant_id.nunique()).rename('n_unique_simulants').reset_index())
df_ops.compute(n_unique_simulants.n_unique_simulants.value_counts())

n_unique_simulants
1      353924439
2       41181290
3        3259303
4         225868
5          14701
         ...    
247            1
216            1
205            1
55             1
157            1
Name: count, Length: 334, dtype: int64

In [26]:
pik_simulant_pairs = df_ops.persist(pik_simulant_pairs.merge(
    n_unique_simulants,
    on='pik',
    how='left',
))
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=108,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [27]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == df_ops.compute(pik_simulant_pairs.n_unique_simulants.max())])

Unnamed: 0,pik,simulant_id,n_unique_simulants
3155742,44_705631,9526_1151086,334
3155743,44_705631,2787_14227,334
3155744,44_705631,3374_167624,334
3155745,44_705631,4950_1033583,334
3155746,44_705631,3713_1038605,334
3155747,44_705631,4203_64253,334
3155748,44_705631,3984_1095858,334
3155749,44_705631,8221_85308,334
3155750,44_705631,9723_1146759,334
3155751,44_705631,315_103299,334


## Definitions of accuracy

1. (most strict) Assigning any PIK with multiple simulants is incorrect
2. Assigning a PIK with multiple simulants is neither incorrect nor correct (excluded from denominator)
3. (most lenient) Assigning a PIK with multiple simulants is correct, as long as at least one of those simulants matches the truth

In [28]:
# All modules, Medicare database, calculated from Layne, Wagner, and Rothhaas Table 1 (p. 15)
real_life_pvs_accuracy = 1 - (2_585 + 60_709 + 129_480 + 89_094) / (52_406_981 + 5_170_924 + 49_374_794 + 50_327_034)
f'{real_life_pvs_accuracy:.5%}'

'99.82079%'

### Definition 1

In [29]:
piks_assigned = df_ops.compute(census_2030_piked.pik.notnull().sum())
piks_assigned

303915660

In [30]:
df_ops.head(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants > 1])

Unnamed: 0,pik,simulant_id,n_unique_simulants
4,0_1000242,3528_680562,2
5,0_1000242,3528_234421,2
13,0_1001101,1219_1068536,2
14,0_1001101,1219_24209,2
18,0_1001592,9776_900062,2
19,0_1001592,7107_889225,2
28,0_1002408,5860_796821,2
29,0_1002408,5969_787903,2
30,0_1002504,1232_155899,2
31,0_1002504,1232_44116,2


In [31]:
single_sim_piks_correct = df_ops.compute(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_piks_correct

267617405

In [32]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / piks_assigned
)

0.8805647099593354

In [33]:
assert len(confirmed_piks_with_ground_truth) == piks_assigned

In [34]:
df_ops.head(census_2030_ground_truth.rename(columns={'record_id': 'record_id_census_2030'}))

Unnamed: 0,record_id_census_2030,simulant_id,possible_to_pik
0,simulated_census_2030_0_37,28_44,1.0
1,simulated_census_2030_0_311,28_365,1.0
2,simulated_census_2030_0_314,28_370,1.0
3,simulated_census_2030_0_349,28_414,1.0
4,simulated_census_2030_0_680,28_804,1.0
5,simulated_census_2030_0_738,28_867,1.0
6,simulated_census_2030_0_1129,28_1312,1.0
7,simulated_census_2030_0_1197,28_1388,1.0
8,simulated_census_2030_0_1626,28_1892,1.0
9,simulated_census_2030_0_1642,28_1912,1.0


In [35]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_correct = df_ops.compute(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .pipe(lambda df: (df.simulant_id_x == df.simulant_id_y) & (df.n_unique_simulants == 1))
        .sum()
)
single_sim_record_links_correct

284389384

In [36]:
(
    single_sim_record_links_correct / piks_assigned
)

0.9357510040779077

### Definition 2

In [37]:
single_sim_piks_assigned = len(census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs[pik_simulant_pairs.n_unique_simulants == 1][['pik', 'simulant_id']]))
single_sim_piks_assigned

267814885

In [38]:
# Overall accuracy, treating it as a black box
(
    single_sim_piks_correct / single_sim_piks_assigned
)

0.9992626250030875

In [39]:
# Looking at whether the exact *record* linked was from the same simulant
single_sim_record_links_assigned = df_ops.compute(
    (confirmed_piks_with_ground_truth
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .n_unique_simulants == 1).sum()
)
single_sim_record_links_assigned

284600979

In [40]:
(
    single_sim_record_links_correct / single_sim_record_links_assigned
)

0.9992565204773944

### Definition 3

In [41]:
pik_simulant_pairs

Unnamed: 0_level_0,pik,simulant_id,n_unique_simulants
npartitions=108,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,int64
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [42]:
piks_at_least_partially_correct = df_ops.persist(
    census_2030_piked[['record_id', 'pik']].merge(pik_simulant_pairs, on='pik').merge(census_2030_ground_truth, on='record_id')
        .pipe(df_ops.drop_duplicates)
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id", "pik"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id,pik,correct
npartitions=668,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,string,string,bool[pyarrow]
,...,...,...
...,...,...,...
,...,...,...
,...,...,...


In [43]:
# Overall accuracy, treating it as a black box
piks_correct_proportion = (df_ops.compute(piks_at_least_partially_correct.correct.sum()) / piks_assigned)
piks_correct_proportion

0.9992742065347998

In [44]:
print(f'{piks_correct_proportion:.5%} of the PIKs assigned were correct; compare with {real_life_pvs_accuracy:.5%} in real life')

99.92742% of the PIKs assigned were correct; compare with 99.82079% in real life


In [45]:
# Looking at whether the exact *record* linked was from the same simulant
sim_record_links_at_least_partially_correct = df_ops.persist(
    confirmed_piks_with_ground_truth
        .merge(
            census_2030_ground_truth.rename(columns={'record_id': 'record_id_raw_input_file'}),
            on='record_id_raw_input_file',
        )
        .merge(
            reference_files_ground_truth.rename(columns={'record_id': 'record_id_reference_file'}),
            on='record_id_reference_file',
        )
        .assign(correct=lambda df: df.simulant_id_x == df.simulant_id_y)
        .pipe(df_ops.groupby_agg_small_groups, by=["record_id_raw_input_file", "record_id_reference_file", "pik", "module_name", "pass_name"], agg_func=lambda x: x.correct.any())
        .reset_index()
)
sim_record_links_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file,pik,module_name,pass_name,correct
npartitions=334,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,string,string,string,string,string,bool[pyarrow]
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [46]:
len(sim_record_links_at_least_partially_correct)

303915660

In [47]:
len(df_ops.drop_duplicates(sim_record_links_at_least_partially_correct[['record_id_raw_input_file', 'record_id_reference_file']]))

303915660

In [48]:
(
    df_ops.compute(sim_record_links_at_least_partially_correct.correct.sum()) / piks_assigned
)

0.999271090538737

In [49]:
assert df_ops.compute((df_ops.groupby_agg_small_groups(confirmed_piks_with_ground_truth, by='record_id_raw_input_file', agg_func=lambda x: x.record_id_reference_file.nunique()) <= 1).all())

In [50]:
# Using definition 3 -- at the PIK level
piks_at_least_partially_correct = df_ops.persist(
    piks_at_least_partially_correct
        .rename(columns={'record_id': 'record_id_raw_input_file'})
        .merge(confirmed_piks_with_ground_truth[['record_id_raw_input_file', 'module_name', 'pass_name']], on='record_id_raw_input_file')
)
piks_at_least_partially_correct

Unnamed: 0_level_0,record_id_raw_input_file,pik,correct,module_name,pass_name
npartitions=668,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,string,string,bool[pyarrow],string,string
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [51]:
# Accuracy by module -- note that this shows the opposite pattern (with the sample data)
# relative to the results of Layne et al., who found GeoSearch was much *more* accurate
df_ops.compute(piks_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.996523,154732
namesearch,0.997646,18325243
geosearch,0.999379,284705539
hhcompsearch,0.999899,730146


In [52]:
# Accuracy by pass -- could be used to tune pass-specific cutoffs, but
# this might not be too informative while we are still using the sample data.
df_ops.compute(piks_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.996523,154732
namesearch,DOB and NYSIIS of name,0.997645,18323262
namesearch,DOB and initials,0.99899,1981
geosearch,geokey name switch,0.999058,876667
geosearch,geokey,0.999272,240966488
hhcompsearch,year of birth,0.999738,213626
hhcompsearch,initials,0.999965,516520
geosearch,house number and street name Soundex,0.999983,13077940
geosearch,some name and DOB information,0.999992,29740242
geosearch,house number and street name Soundex name switch,1.0,44202


In [53]:
# Using definition 3 -- at the link level
df_ops.compute(sim_record_links_at_least_partially_correct.groupby("module_name").correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,mean,size
module_name,Unnamed: 1_level_1,Unnamed: 2_level_1
dobsearch,0.996523,154732
namesearch,0.997646,18325243
geosearch,0.999376,284705539
hhcompsearch,0.999899,730146


In [54]:
df_ops.compute(sim_record_links_at_least_partially_correct.groupby(["module_name", "pass_name"]).correct.agg(["mean", "size"]).sort_values("mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size
module_name,pass_name,Unnamed: 2_level_1,Unnamed: 3_level_1
dobsearch,initials name switch,0.996523,154732
namesearch,DOB and NYSIIS of name,0.997645,18323262
namesearch,DOB and initials,0.99899,1981
geosearch,geokey name switch,0.99905,876667
geosearch,geokey,0.999268,240966488
hhcompsearch,year of birth,0.999738,213626
hhcompsearch,initials,0.999965,516520
geosearch,house number and street name Soundex,0.999983,13077940
geosearch,some name and DOB information,0.999992,29740242
geosearch,house number and street name Soundex name switch,1.0,44202


In [55]:
df_ops.compute(sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct].groupby(["module_name", "pass_name"]).size()).sort_values()

module_name   pass_name                           
namesearch    DOB and initials                             2
hhcompsearch  initials                                    18
              year of birth                               56
geosearch     house number and street name Soundex       218
              some name and DOB information              247
dobsearch     initials name switch                       538
geosearch     geokey name switch                         833
namesearch    DOB and NYSIIS of name                   43144
geosearch     geokey                                  176471
dtype: int64

### Incorrect and missed PIKs

In [56]:
incorrectly_linked_pairs = df_ops.persist(df_ops.drop_duplicates(
    sim_record_links_at_least_partially_correct[~sim_record_links_at_least_partially_correct.correct]
        [["record_id_raw_input_file", "record_id_reference_file"]]
))
incorrectly_linked_pairs

Unnamed: 0_level_0,record_id_raw_input_file,record_id_reference_file
npartitions=668,Unnamed: 1_level_1,Unnamed: 2_level_1
,string,string
,...,...
...,...,...
,...,...
,...,...


In [57]:
len(incorrectly_linked_pairs)

221527

In [58]:
incorrect_links = df_ops.head(incorrectly_linked_pairs, n=100)
incorrect_links

Unnamed: 0,record_id_raw_input_file,record_id_reference_file
0,simulated_census_2030_0_1000425,simulated_geobase_reference_file_284_1703768
1,simulated_census_2030_0_1000941,simulated_geobase_reference_file_25_467554
2,simulated_census_2030_0_1001091,simulated_geobase_reference_file_200_917055
3,simulated_census_2030_0_1001617,simulated_geobase_reference_file_191_1983837
4,simulated_census_2030_0_1001689,simulated_geobase_reference_file_247_460345
...,...,...
95,simulated_census_2030_0_185313,simulated_geobase_reference_file_156_1056267
96,simulated_census_2030_0_185604,simulated_geobase_reference_file_343_716434
97,simulated_census_2030_0_187817,simulated_geobase_reference_file_163_1501087
98,simulated_census_2030_0_18943,simulated_geobase_reference_file_16_1773322


In [None]:
%xdel incorrectly_linked_pairs

In [61]:
comparison_cols = [
    "first_name",
    "middle_name",
    "last_name",
    "date_of_birth",
    "street_number",
    "street_name",
    "unit_number",
    "city",
    "state",
]

incorrect_links_detail = (
    incorrect_links
        .merge(
            census_2030_piked[census_2030_piked.record_id.isin(incorrect_links.record_id_raw_input_file)]
                .compute()
                .rename(columns={"record_id": "record_id_raw_input_file", "middle_initial": "middle_name"})
                [["record_id_raw_input_file"] + comparison_cols],
            on="record_id_raw_input_file",
            how="left",
        )
        .merge(
            reference_file[reference_file.record_id.isin(incorrect_links.record_id_reference_file)]
                .compute()
                .rename(columns={"record_id": "record_id_reference_file"})
                .rename(columns=lambda c: c.replace('mailing_address_', ''))
                [["record_id_reference_file"] + comparison_cols],
            on="record_id_reference_file",
            how="left",
            suffixes=("_census", "_reference_file"),
        )
)
def flatten(xss):
    return [x for xs in xss for x in xs]

incorrect_links_detail[flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])]

Unnamed: 0,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,Penny,,M,Solomon,Ezell,Ezell,09/15/2028,20280915,9639,9639,givhans rd,GIVHANS RD,,,parkville,PARKVILLE,MD,MD
1,Audrey,Jade,G,,Bell,Bell,03/06/2011,20110306,109,109,hubbell street,HUBBELL STREET,,,lorton,LORTON,VA,VA
2,,Amelia,O,Abby,Wilson,Wilson,10/23/2028,20281023,946,946,jessica lane,JESSICA LANE,,,monkton,MONKTON,MD,MD
3,Royalty,Camilla,E,Evelyn,Giordano,Giordano,10/09/2028,20281009,9136,9136,pontiac lake rd,PONTIAC LAKE RD,,,alameda,ALAMEDA,CA,CA
4,Alaia,Martin,A,Alexander,Reitz,Reitz,10/12/2028,20281012,6100,6100,burdette drive,BURDETTE DRIVE,,,chicago,CHICAGO,IL,IL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Kristin,Charles,E,,Bewick,Bewick,11/15/1963,19631115,20961,20961,gebron dr,GEBRON DR,,,salt lake city,SALT LAKE CITY,UT,UT
96,Brooklyn,David,E,Eric,Stange,Stange,02/19/2019,20190219,6539,6539,oahu isl,OAHU ISL,,,litl falls,LITL FALLS,NJ,NJ
97,Michael,Michelle,M,Marcia,Mcmahan,Mcmahan,04/06/1963,19630406,1114,1114,s lions spg wy,S LIONS SPG WY,,,austin,AUSTIN,TX,TX
98,,Elliott,C,Paul,Pacheco,Pacheco,11/29/2016,20161129,500,500,philadelphia way,PHILADELPHIA WAY,,,benton,BENTON,NY,NY


In [63]:
missed_links = df_ops.persist(
    census_2030_piked[census_2030_piked.pik.isnull()][["record_id"]]
        .merge(census_2030_ground_truth, on="record_id")
        .merge(reference_files_ground_truth[reference_files_ground_truth.n_unique_simulants == 1], on="simulant_id", suffixes=("_census", "_reference_file"))
)

In [65]:
len(missed_links)

116621923

In [66]:
simulants_missed = df_ops.head(missed_links[['simulant_id']], n=100).simulant_id.unique()
simulants_missed

<ArrowStringArray>
[  '28_365249',   '28_623546',   '40_181957',   '40_188899',   '93_545215',
    '99_95719',   '99_880072',   '99_939969',  '103_100559', '107_1172955',
  '131_996214', '278_1021948', '315_1154921',  '323_191653',  '323_912938',
  '352_526740',   '446_78851', '465_1023090',   '478_82913',  '478_194997',
  '478_230323', '496_1177062',  '539_822093',  '539_953168',  '558_341841',
  '656_554266',  '656_776297',   '682_73548', '682_1103329',   '771_12558']
Length: 30, dtype: string

In [None]:
missed_pairs = missed_links[missed_links.simulant_id.isin(list(simulants_missed))].compute()
missed_pairs

Unnamed: 0,record_id_census,simulant_id,possible_to_pik,record_id_reference_file,n_unique_simulants
0,simulated_census_2030_0_308282,28_365249,1.0,simulated_geobase_reference_file_380_744864,1
1,simulated_census_2030_0_308282,28_365249,1.0,simulated_name_dob_reference_file_115_755841,1
2,simulated_census_2030_0_308282,28_365249,1.0,simulated_geobase_reference_file_380_744865,1
3,simulated_census_2030_0_526134,28_623546,1.0,simulated_geobase_reference_file_119_1624848,1
4,simulated_census_2030_0_526134,28_623546,1.0,simulated_geobase_reference_file_119_1624849,1
...,...,...,...,...,...
95,simulated_census_2030_23_61933,682_73548,1.0,simulated_name_dob_reference_file_258_774653,1
96,simulated_census_2030_23_934916,682_1103329,1.0,simulated_geobase_reference_file_139_2223854,1
97,simulated_census_2030_23_934916,682_1103329,1.0,simulated_name_dob_reference_file_366_114787,1
98,simulated_census_2030_25_10574,771_12558,1.0,simulated_geobase_reference_file_232_595644,1


In [None]:
%xdel missed_links

In [75]:
missed_links_detail = (
    missed_pairs
        .merge(census_2030_piked[census_2030_piked.record_id.isin(list(missed_pairs.record_id_census))].compute().rename(columns={"record_id": "record_id_census", "middle_initial": "middle_name"}), on="record_id_census")
        .merge(reference_file[reference_file.record_id.isin(missed_pairs.record_id_reference_file)].compute().rename(columns=lambda c: c.replace('mailing_address_', '')).rename(columns={"record_id": "record_id_reference_file"}), on="record_id_reference_file", suffixes=("_census", "_reference_file"))
)

In [76]:
for simulant in simulants_missed:
    print(simulant)
    display(missed_links_detail[missed_links_detail.simulant_id == simulant][['simulant_id'] + flatten([(f'{c}_census', f'{c}_reference_file') for c in comparison_cols])])

28_365249


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
0,28_365249,Louis,Louis,R,Russell,Ansley,Ansley,10/07/2006,19681123,220,220.0,4th ave nw,4TH AVE NW,,,visalia,,KY,CA
1,28_365249,Louis,Louis,R,Russell,Ansley,Ansley,10/07/2006,19681123,220,,4th ave nw,,,,visalia,,KY,
2,28_365249,Louis,Louis,R,Russell,Ansley,Ansley,10/07/2006,19681123,220,220.0,4th ave nw,4TH AVE NW,,,visalia,VISALIA,KY,CA


28_623546


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
3,28_623546,Scott,Sceeter,J,Joseph,Justman,Justman,09/22/1975,19750922,6341,3.0,,CLEMENT RD,,,chandler,GRANDVIEW,AZ,MO
4,28_623546,Scott,Sceeter,J,Joseph,Justman,Justman,09/22/1975,19750922,6341,6341.0,,HUBBERT STREET,,,chandler,CHANDLER,AZ,AZ
5,28_623546,Scott,Sceeter,J,Joseph,Justman,Justman,09/22/1975,19750922,6341,,,,,,chandler,,AZ,


40_181957


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
6,40_181957,Doreen,Doreen,K,Kimberly,Miranda,Miranda,06/26/1982,19630203,447,,pease drive,,,,albuquerque,,NM,
7,40_181957,Doreen,Doreen,K,Kimberly,Miranda,Miranda,06/26/1982,19630203,447,447.0,pease drive,PEASE DRIVE,,,albuquerque,ALBUQUERQUE,NM,CO
8,40_181957,Doreen,Doreen,K,Kimberly,Miranda,Miranda,06/26/1982,19630203,447,447.0,pease drive,PEASE DRIVE,,,albuquerque,ALBUQUERQUE,NM,NM


40_188899


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
9,40_188899,Andrew,Davis,D,Andrew,Davis,Daniel,02/21/2002,20020221,503,503.0,s windsor blvd,S WINDSOR BLVD,,,chicago,CHICAGO,IL,IL
10,40_188899,Andrew,Davis,D,Andrew,Davis,Daniel,02/21/2002,20020221,503,4404.0,s windsor blvd,MIZZENMAST ROAD,,APT 223R,chicago,JUNCTION CITY,IL,KS
11,40_188899,Andrew,Davis,D,Andrew,Davis,Daniel,02/21/2002,20020221,503,,s windsor blvd,,,,chicago,,IL,


93_545215


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
12,93_545215,Steven,Steven,A,Aaron,Tam,Tam,02/05/1978,19411011,9460,9460.0,mendelssohn avenue north,MENDELSSOHN AVENUE NORTH,,,round rock,ROUND ROCK,TX,TX
13,93_545215,Steven,Steven,A,Aaron,Tam,Tam,02/05/1978,19411011,9460,,mendelssohn avenue north,,,,round rock,AUSTIN,TX,TX
14,93_545215,Steven,Steven,A,Aaron,Tam,Tam,02/05/1978,19411011,9460,,mendelssohn avenue north,,,,round rock,,TX,


99_95719


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
15,99_95719,Thomas,Thomas,E,Edward,Jilek,Jilek,04/10/1968,19960506,729,729.0,s 1130 w,S 1130 W,,,borger,BORGER,TX,TX
16,99_95719,Thomas,Thomas,E,Edward,Jilek,Jilek,04/10/1968,19960506,729,729.0,s 1130 w,S 1130 W,,,borger,BORGER,TX,TX
17,99_95719,Thomas,Thomas,E,Edward,Jilek,Jilek,04/10/1968,19960506,729,,s 1130 w,,,,borger,,TX,
18,99_95719,Thomas,Thomas,E,Edward,Jilek,Jilek,04/10/1968,19960506,729,729.0,s 1130 w,,,,borger,BORGER,TX,TX
19,99_95719,Thomas,Thomas,E,Edward,Jilek,Jilek,04/10/1968,19960506,729,729.0,s 1130 w,S 1130 W,,,borger,,TX,TX


99_880072


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
20,99_880072,Octavia,Octavia,T,Tracy,Rodrigues,Rodrigues,07/18/|963,19630718,1805,,morr:s farm rd,,,,oak park,,IL,
21,99_880072,Octavia,Octavia,T,Tracy,Rodrigues,Rodrigues,07/18/|963,19630718,1805,1805.0,morr:s farm rd,MORRIS FARM RD,,,oak park,OAK PARK,IL,IL


99_939969


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
22,99_939969,Cynthia,Cynthia,M,Michelle,Weber,Weber,04/05/1969,19690504,1|918,,gillis st,,,,pittsburgh,,PA,
23,99_939969,Cynthia,Cynthia,M,Michelle,Weber,Weber,04/05/1969,19690504,1|918,11918.0,gillis st,GILLJS ST,,,pittsburgh,PITTSBURGH,PA,PA
24,99_939969,Cynthia,Cynthia,M,Michelle,Weber,Weber,04/05/1969,19690504,1|918,11918.0,gillis st,GILLIS ST,,,pittsburgh,PITTSBUR9H,PA,PA
25,99_939969,Cynthia,Cynthia,M,Michelle,Weber,Weber,04/05/1969,19690504,1|918,11918.0,gillis st,GILLIS ST,,,pittsburgh,PITTSBURGH,PA,PA


103_100559


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
26,103_100559,Diamond,Diamond,A,Aria,Johnson,Johnson,09/14/2005,20050914,2134,2134.0,briar forest dr,BRIAR FOREST DR,,,los angeles,LOS ANGELES,CA,MI
27,103_100559,Diamond,Diamond,A,Aria,Johnson,Johnson,09/14/2005,20050914,2134,,briar forest dr,,,,los angeles,,CA,
28,103_100559,Diamond,Diamond,A,Aria,Johnson,Johnson,09/14/2005,20050914,2134,2134.0,briar forest dr,BRIAR FOREST DR,,,los angeles,LOS ANGELES,CA,CA
29,103_100559,Diamond,Diamond,A,Aria,Johnson,Johnson,09/14/2005,20050914,2134,2134.0,briar forest dr,BRIAR FOREST DR,,,los angeles,LEOS ANKELES,CA,CA


107_1172955


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
30,107_1172955,Jacqueline,Jacqueline,K,Kathryn,Rivera,Rivera,,19510418,1258,,magnolia dr,,,,chicago,,IL,
31,107_1172955,Jacqueline,Jacqueline,K,Kathryn,Rivera,Rivera,,19510418,1258,,magnolia dr,,,,chicago,,IL,


131_996214


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
32,131_996214,Brandon,Brandon,Gallacher,Howard,H,Gallacher,01/13/1987,19870113,13280,,sth 70,,,,new port richey,,FL,
33,131_996214,Brandon,Brandon,Gallacher,Howard,H,Gallacher,01/13/1987,19870113,13280,13280.0,sth 70,STH 70,,,new port richey,NEW PORT RICHEY,FL,FL
34,131_996214,Brandon,Brandon,Gallacher,Howard,H,Gallacher,01/13/1987,19870113,13280,740.0,sth 70,CHRISTENSEN AVENUE,,,new port richey,CLARKSVILLE,FL,AR
35,131_996214,Brandon,Brandon,Gallacher,Howard,H,Gallacher,01/13/1987,19870113,13280,13280.0,sth 70,STH 70,,,new port richey,NEW PORT RICHEY,FL,FL


278_1021948


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
36,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,3879.0,main street,MAIN STREET,,,allen,ALLEN,TX,TX
37,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,965.0,main street,E MAIN ST,,,allen,SPARKS,TX,
38,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,3879.0,main street,MAIN STREET,,,allen,ALLEN,TX,TX
39,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,965.0,main street,E MAIN ST,,,allen,SPARKS,TX,NV
40,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,3879.0,main street,MAIN STREET,,,allen,ALLEN,TX,NM
41,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,,main street,,,,allen,,TX,
42,278_1021948,Katelynn,Miss,E,Elizabeth,Tom,Tom,03/27/1491,19910327,3879,965.0,main street,E MAIN ST,,,allen,APWRKS,TX,NV


315_1154921


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
43,315_1154921,Mark,Mark,C,C,Krumm,Krum,02/22/1979,,9627,,,,,,omaha,,NE,
44,315_1154921,Mark,Mark,C,C,Krumm,Krumm,02/22/1979,,9627,9627.0,,RYAN DR,,,omaha,OMAHA,NE,NE
45,315_1154921,Mark,Mark,C,C,Krumm,Krum,02/22/1979,,9627,9627.0,,RYAN DR,,,omaha,OMAHA,NE,NE
46,315_1154921,Mark,Mark,C,C,Krumm,Krumm,02/22/1979,,9627,,,,,,omaha,,NE,


323_191653


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
47,323_191653,Chad,Chad,C,Charles,Escalante Gomez,Escalante Gomez,05/01/1985,19880825,,,ne lawrence ave,,,,cherry hill township,,NJ,
48,323_191653,Chad,Chad,C,Charles,Escalante Gomez,Escalante Gomez,05/01/1985,19880825,,,ne lawrence ave,NE LAWRENCE AVE,,,cherry hill township,CHERRY HILL TOWNSHIP,NJ,CT
49,323_191653,Chad,Chad,C,Charles,Escalante Gomez,Escalante Gomez,05/01/1985,19880825,,,ne lawrence ave,NE LAWRENCE AVE,,,cherry hill township,CHERRY HILL TOWNSHIP,NJ,NJ


323_912938


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
50,323_912938,Henry,Henry,I,Israel,Dang,Dang,31/08/2009,20090831,,85.0,burr oak boulevard,MONTERREY DR,,,lillington,ALBUQUERQUE,NC,NM
51,323_912938,Henry,Henry,I,Israel,Dang,Dang,31/08/2009,20090831,,,burr oak boulevard,BURR OAK BOULEVARD,,,lillington,LILLINGTON,NC,NC
52,323_912938,Henry,Henry,I,Israel,Dang,Dang,31/08/2009,20090831,,,burr oak boulevard,,,,lillington,,NC,


352_526740


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
53,352_526740,Shelly,Rachel,R,Remi,Kloss,Kloss,11/15/2018,20181115,2150,,lansdale road,,,,bellflower,,CA,
54,352_526740,Shelly,Rachel,R,Remi,Kloss,Kloss,11/15/2018,20181115,2150,,lansdale road,,,,bellflower,COACHELLA,CA,CA
55,352_526740,Shelly,Rachel,R,Remi,Kloss,Kloss,11/15/2018,20181115,2150,,lansdale road,SE CATES CIR,,,bellflower,PHILADELPHIA,CA,PA


446_78851


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
56,446_78851,Zachary,Zachary,,David,D Mckinney,Mckinney,01/30/1994,19940130,197,,82nd drive,,,,summerfield,,NC,
57,446_78851,Zachary,Zachary,,David,D Mckinney,Mckinney,01/30/1994,19940130,197,197.0,82nd drive,82ND DRIVE,,,summerfield,SUMMERFIELD,NC,NC


465_1023090


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
58,465_1023090,Kalani,Scarlett,S,Kalani,Gokhale,De La Casa,04/28/2020,20200428,503,503.0,sterling cir,STERLING CIR,,,laurel,LAUREL,MD,AZ
59,465_1023090,Kalani,Scarlett,S,Kalani,Gokhale,De La Casa,04/28/2020,20200428,503,503.0,sterling cir,STERLING CIR,,,laurel,LAUREL,MD,MD
60,465_1023090,Kalani,Scarlett,S,Kalani,Gokhale,De La Casa,04/28/2020,20200428,503,,sterling cir,,,,laurel,,MD,
61,465_1023090,Kalani,Scarlett,S,Kalani,Gokhale,De La Casa,04/28/2020,20200428,503,943.0,sterling cir,47TH ST,,,laurel,CLEARWATER,MD,FL


478_82913


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
62,478_82913,Lois,Lois,L,Luann,Roberts,Refuse,09/29/1961,19610929,8329,,northwwst ashcreek ln,,,,ferriday,,LA,
63,478_82913,Lois,Lois,L,Luann,Roberts,Refuse,09/29/1961,19610929,8329,8329.0,northwwst ashcreek ln,NORTHWEST ASHCREEK LN,,,ferriday,FERRIDAY,LA,LA


478_194997


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
64,478_194997,Lyndsey,,S,Sara,Kupstas,Kupstas,27/03/2006,20060327,2425,2425.0,lakeside dr,LAKESIDE DR,,,coopersville,COOPERSVILLE,MI,MI
65,478_194997,Lyndsey,,S,Sara,Kupstas,Kupstas,27/03/2006,20060327,2425,7176.0,lakeside dr,HICKORYNUT CIR,,,coopersville,HITCHCOCK,MI,TX
66,478_194997,Lyndsey,,S,Sara,Kupstas,Kupstas,27/03/2006,20060327,2425,,lakeside dr,HICKORYNUT CIR,,,coopersville,HITCHCOCK,MI,TX
67,478_194997,Lyndsey,,S,Sara,Kupstas,Kupstas,27/03/2006,20060327,2425,,lakeside dr,,,,coopersville,,MI,
68,478_194997,Lyndsey,,S,Sara,Kupstas,Kupstas,27/03/2006,20060327,2425,7176.0,lakeside dr,HICKORYNUT CIR,,,coopersville,HITCHCOCK,MI,TX


478_230323


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
69,478_230323,Jenna,Jenna,D,Domonique,Harrison,Harrison,02/21/2013,19860214,15,,81 street,,,,sterling ht,,MI,
70,478_230323,Jenna,Jenna,D,Domonique,Harrison,Harrison,02/21/2013,19860214,15,,81 street,,,,sterling ht,TAMPA,MI,FL
71,478_230323,Jenna,Jenna,D,Domonique,Harrison,Harrison,02/21/2013,19860214,15,15.0,81 street,81 STREET,,,sterling ht,STERLING HT,MI,MI
72,478_230323,Jenna,Jenna,D,Domonique,Harrison,Harrison,02/21/2013,19860214,15,,81 street,,,,sterling ht,LAKE CITY,MI,IA


496_1177062


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
73,496_1177062,Leon,Leon,J,Jordan,Person,Williams,03/01/2029,20290301,323,,nash ave,NASH AVE,,,greer,GREER,SC,SC
74,496_1177062,Leon,Leon,J,Jordan,Person,Williams,03/01/2029,20290301,323,,nash ave,,,,greer,,SC,


539_822093


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
75,539_822093,Nicholas,Nicky,K,Kenneth,Martin,Martin,03/05/1985,19850307,3073,1802.0,wild timber rd,KAREN CT,,,montgomery,NEMO,TX,TX
76,539_822093,Nicholas,Nicky,K,Kenneth,Martin,Martin,03/05/1985,19850307,3073,3073.0,wild timber rd,WILD TIMBER RD,,,montgomery,MONTGOMERY,TX,TX
77,539_822093,Nicholas,Nicky,K,Kenneth,Martin,Martin,03/05/1985,19850307,3073,7550.0,wild timber rd,CANAAN RD,,,montgomery,GREENFIELD,TX,CA
78,539_822093,Nicholas,Nicky,K,Kenneth,Martin,Martin,03/05/1985,19850307,3073,,wild timber rd,,,,montgomery,,TX,


539_953168


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
79,539_953168,Robert,Robert,Q,Anthony,,Davis,07/07/1984,19840707,809,809.0,old coach rd,,,,huntsville,HUNTSVILLE,AL,AL
80,539_953168,Robert,Robert,Q,Anthony,,Davis,07/07/1984,19840707,809,,old coach rd,,,,huntsville,,AL,


558_341841


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
81,558_341841,Elaine,Elaine,M,Melissa,Hulme,Hulme,01/17/1978,20190303,305,,w 15th st,TALBOT STREET,,,saco,UNION VALLEY,ME,TX
82,558_341841,Elaine,Elaine,M,Melissa,Hulme,Hulme,01/17/1978,20190303,305,305.0,w 15th st,W 15TH ST,,,saco,SACO,ME,ME
83,558_341841,Elaine,Elaine,M,Melissa,Hulme,Hulme,01/17/1978,20190303,305,,w 15th st,TALBOT STREET,,,saco,UNION VALLEY,ME,TX
84,558_341841,Elaine,Elaine,M,Melissa,Hulme,Hulme,01/17/1978,20190303,305,,w 15th st,,,,saco,,ME,
85,558_341841,Elaine,Elaine,M,Melissa,Hulme,Hulme,01/17/1978,20190303,305,,w 15th st,W 15TH ST,,,saco,SACO,ME,ME


656_554266


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
86,656_554266,Jennifer,Jennifer,J,Jessica,Holland,Holland,11/15/1985,19851115,5750,957.0,s cntrl ave,LINDEN AVE,,,southfield,JONESBORO,MI,GA
87,656_554266,Jennifer,Jennifer,J,Jessica,Holland,Holland,11/15/1985,19851115,5750,,s cntrl ave,S CNTRL AVE,,,southfield,SOUTHFIELD,MI,MI
88,656_554266,Jennifer,Jennifer,J,Jessica,Holland,Holland,11/15/1985,19851115,5750,5750.0,s cntrl ave,S CNTRL AVE,,,southfield,SOUTHFIELD,MI,MI
89,656_554266,Jennifer,Jennifer,J,Jessica,Holland,Holland,11/15/1985,19851115,5750,951.0,s cntrl ave,LINDEN AVE,,,southfield,JONESBORO,MI,GA
90,656_554266,Jennifer,Jennifer,J,Jessica,Holland,Holland,11/15/1985,19851115,5750,,s cntrl ave,,,,southfield,,MI,


656_776297


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
91,656_776297,Todd,Todd,B,Keller,Keller,Bradley,09/25/1966,19660925,15161,15161.0,woodmere ln,WOODMERE LN,,,city of cottage grove,CITY OF COTTAGE GROVE,MN,MN
92,656_776297,Todd,Todd,B,Keller,Keller,Bradley,09/25/1966,19660925,15161,,woodmere ln,,,,city of cottage grove,,MN,
93,656_776297,Todd,Todd,B,Keller,Keller,Bradley,09/25/1966,19660925,15161,15161.0,woodmere ln,,,,city of cottage grove,CITY OF COTTAGE GROVE,MN,MN


682_73548


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
94,682_73548,Eric,Eric,T,Thomas,Villegas,Villegas,12/09/1998,19690309,65,65.0,parkwood drive,PARKWOOD DRIVE,,,frederick,FREDERICK,CO,CO
95,682_73548,Eric,Eric,T,Thomas,Villegas,Villegas,12/09/1998,19690309,65,,parkwood drive,,,,frederick,,CO,


682_1103329


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
96,682_1103329,Rachel,Rachel,T,T,Rodriguez,Rodriguez,11/03/1993,,441,5.0,belrose rd,ROYAL PALM BEACH BLVD,,,s diego,BETHLEHEM,CA,NY
97,682_1103329,Rachel,Rachel,T,T,Rodriguez,Rodriguez,11/03/1993,,441,,belrose rd,,,,s diego,,CA,


771_12558


Unnamed: 0,simulant_id,first_name_census,first_name_reference_file,middle_name_census,middle_name_reference_file,last_name_census,last_name_reference_file,date_of_birth_census,date_of_birth_reference_file,street_number_census,street_number_reference_file,street_name_census,street_name_reference_file,unit_number_census,unit_number_reference_file,city_census,city_reference_file,state_census,state_reference_file
98,771_12558,Sxott,Scott,T,Tyrone,Norris,Norris,04/30/1974,19740430,2215,2215.0,mountainside d,MOUNTAINSIDE D,,,amherst,AMHERST,NY,NY
99,771_12558,Sxott,Scott,T,Tyrone,Norris,Norris,04/30/1974,19740430,2215,,mountainside d,,,,amherst,,NY,
