# Generate pseudopeople simulated datasets

The very first step is generating pseudopeople data that will be used both directly in the case study, and to create the reference files.
Since this is an intensive operation and currently can only be distributed with Modin, we do only this step in this notebook, then
save the results.

In [1]:
import warnings
import pseudopeople as psp
import numpy as np
import os, shutil
import logging
from pathlib import Path

# Importing pandas for access, regardless of whether we are using it as the compute engine
import pandas

In [2]:
%load_ext autoreload
%autoreload 1

In [3]:
from vivarium_research_prl import distributed_compute, utils

In [4]:
warnings.simplefilter(action="ignore", category=FutureWarning)

In [5]:
# DO NOT EDIT if this notebook is not called generate_pseudopeople_simulated_datasets.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
# When you run this, save it to another filename.
data_to_use = "small_sample"
output_dir = "../output/generate_simulated_data"
compute_engine = "pandas"
# Only matter if distributing
num_jobs = 5
cpus_per_job = 2
threads_per_job = 1
memory_per_job = "10GB"
very_noisy = True
pseudopeople_seed = 0
local_directory = f"/tmp/{os.environ['USER']}_dask"

In [6]:
# Parameters
data_to_use = "ri"
output_dir = (
    "/ihme/scratch/users/zmbc/person_linkage_case_study/generate_simulated_data/"
)
very_noisy = False
compute_engine = "dask"
num_jobs = 100
memory_per_job = "30GB"
cpus_per_job = 5
pseudopeople_seed = 1

In [7]:
utils.ensure_empty(local_directory)

In [8]:
output_dir = str(Path(output_dir) / data_to_use / "pseudopeople_simulated_datasets")
utils.ensure_empty(output_dir)

In [9]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=num_jobs,
    cpus_per_job=cpus_per_job,
    threads_per_job=threads_per_job,
    memory_per_job=memory_per_job,
    local_directory=local_directory,
)

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.111.9:8787/status,

0,1
Dashboard: http://10.158.111.9:8787/status,Workers: 100
Total threads: 100,Total memory: 2.73 TiB

0,1
Comm: tcp://10.158.111.9:32793,Workers: 100
Dashboard: http://10.158.111.9:8787/status,Total threads: 100
Started: Just now,Total memory: 2.73 TiB

0,1
Comm: tcp://10.158.100.54:34999,Total threads: 1
Dashboard: http://10.158.100.54:35895/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.54:42587,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-y815i5hi,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-y815i5hi

0,1
Comm: tcp://10.158.100.143:33429,Total threads: 1
Dashboard: http://10.158.100.143:42889/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.143:46183,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jyxqtywz,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jyxqtywz

0,1
Comm: tcp://10.158.106.37:40931,Total threads: 1
Dashboard: http://10.158.106.37:46321/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.37:45691,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ce00h97a,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ce00h97a

0,1
Comm: tcp://10.158.106.12:33349,Total threads: 1
Dashboard: http://10.158.106.12:41619/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.12:45601,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-toorau10,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-toorau10

0,1
Comm: tcp://10.158.106.57:44453,Total threads: 1
Dashboard: http://10.158.106.57:46235/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:40871,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tlc4c2w_,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tlc4c2w_

0,1
Comm: tcp://10.158.111.40:45715,Total threads: 1
Dashboard: http://10.158.111.40:39507/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:44529,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-myw3s2ij,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-myw3s2ij

0,1
Comm: tcp://10.158.148.19:35901,Total threads: 1
Dashboard: http://10.158.148.19:37843/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:39307,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-f1t6m6v2,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-f1t6m6v2

0,1
Comm: tcp://10.158.148.19:41271,Total threads: 1
Dashboard: http://10.158.148.19:42203/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:45003,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-akvtxdgq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-akvtxdgq

0,1
Comm: tcp://10.158.106.8:44003,Total threads: 1
Dashboard: http://10.158.106.8:44505/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:39789,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-k3qv92rz,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-k3qv92rz

0,1
Comm: tcp://10.158.111.40:34361,Total threads: 1
Dashboard: http://10.158.111.40:40039/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:34931,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wob3nal2,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wob3nal2

0,1
Comm: tcp://10.158.148.25:41085,Total threads: 1
Dashboard: http://10.158.148.25:39729/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:33369,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vowcgs7x,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vowcgs7x

0,1
Comm: tcp://10.158.148.63:45289,Total threads: 1
Dashboard: http://10.158.148.63:40351/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:36719,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kuj2mjfo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kuj2mjfo

0,1
Comm: tcp://10.158.111.40:33275,Total threads: 1
Dashboard: http://10.158.111.40:44681/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:41109,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uo5t_1q1,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-uo5t_1q1

0,1
Comm: tcp://10.158.148.63:39215,Total threads: 1
Dashboard: http://10.158.148.63:36399/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:40947,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-zg0bewha,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-zg0bewha

0,1
Comm: tcp://10.158.148.25:41257,Total threads: 1
Dashboard: http://10.158.148.25:38073/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:36015,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3w9gammz,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3w9gammz

0,1
Comm: tcp://10.158.96.18:38431,Total threads: 1
Dashboard: http://10.158.96.18:43555/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.18:45997,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bf9o9twr,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bf9o9twr

0,1
Comm: tcp://10.158.100.143:41821,Total threads: 1
Dashboard: http://10.158.100.143:35333/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.143:46391,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wi8d_dss,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wi8d_dss

0,1
Comm: tcp://10.158.148.19:37901,Total threads: 1
Dashboard: http://10.158.148.19:36273/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:37455,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fa0xnhz5,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fa0xnhz5

0,1
Comm: tcp://10.158.96.38:44407,Total threads: 1
Dashboard: http://10.158.96.38:44045/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:34325,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-idphjdl7,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-idphjdl7

0,1
Comm: tcp://10.158.100.42:42535,Total threads: 1
Dashboard: http://10.158.100.42:43425/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.42:35355,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qlrx6rrt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qlrx6rrt

0,1
Comm: tcp://10.158.111.40:37319,Total threads: 1
Dashboard: http://10.158.111.40:46591/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:39151,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9z6_sp30,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9z6_sp30

0,1
Comm: tcp://10.158.96.38:37203,Total threads: 1
Dashboard: http://10.158.96.38:35249/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:40361,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sa8qecmx,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sa8qecmx

0,1
Comm: tcp://10.158.111.40:43433,Total threads: 1
Dashboard: http://10.158.111.40:34637/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:42921,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vbacwlku,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vbacwlku

0,1
Comm: tcp://10.158.148.25:34575,Total threads: 1
Dashboard: http://10.158.148.25:45059/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:42097,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d9snw9xt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-d9snw9xt

0,1
Comm: tcp://10.158.106.57:39087,Total threads: 1
Dashboard: http://10.158.106.57:37117/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:40999,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0j_hmq53,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0j_hmq53

0,1
Comm: tcp://10.158.106.8:35917,Total threads: 1
Dashboard: http://10.158.106.8:33303/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:39921,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wbra3rlj,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-wbra3rlj

0,1
Comm: tcp://10.158.148.63:36707,Total threads: 1
Dashboard: http://10.158.148.63:41851/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:44847,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ldnpey_v,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ldnpey_v

0,1
Comm: tcp://10.158.106.57:43197,Total threads: 1
Dashboard: http://10.158.106.57:33893/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:44945,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bcpvv8_u,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bcpvv8_u

0,1
Comm: tcp://10.158.148.63:38471,Total threads: 1
Dashboard: http://10.158.148.63:44003/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:34643,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b_aj4f5k,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b_aj4f5k

0,1
Comm: tcp://10.158.106.57:37005,Total threads: 1
Dashboard: http://10.158.106.57:38653/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:45941,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-s_aucg9v,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-s_aucg9v

0,1
Comm: tcp://10.158.111.40:45953,Total threads: 1
Dashboard: http://10.158.111.40:45309/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:34719,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qaekwumq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-qaekwumq

0,1
Comm: tcp://10.158.106.12:40891,Total threads: 1
Dashboard: http://10.158.106.12:33213/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.12:37405,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xctdfudb,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xctdfudb

0,1
Comm: tcp://10.158.96.18:43549,Total threads: 1
Dashboard: http://10.158.96.18:46311/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.18:35127,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-27i277u0,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-27i277u0

0,1
Comm: tcp://10.158.111.40:45205,Total threads: 1
Dashboard: http://10.158.111.40:44233/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:39277,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3vx3cj33,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3vx3cj33

0,1
Comm: tcp://10.158.106.11:43321,Total threads: 1
Dashboard: http://10.158.106.11:35071/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.11:33395,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-eg66u5bt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-eg66u5bt

0,1
Comm: tcp://10.158.106.8:44799,Total threads: 1
Dashboard: http://10.158.106.8:45017/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:42859,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-lwuisirp,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-lwuisirp

0,1
Comm: tcp://10.158.148.63:37647,Total threads: 1
Dashboard: http://10.158.148.63:41429/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:34775,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-nnix7_5e,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-nnix7_5e

0,1
Comm: tcp://10.158.111.40:43111,Total threads: 1
Dashboard: http://10.158.111.40:41803/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:44245,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tb3dzm_7,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-tb3dzm_7

0,1
Comm: tcp://10.158.111.9:35953,Total threads: 1
Dashboard: http://10.158.111.9:34269/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.9:45211,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4y_xh5fi,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4y_xh5fi

0,1
Comm: tcp://10.158.106.57:36189,Total threads: 1
Dashboard: http://10.158.106.57:35403/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:39149,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ndfu9d61,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ndfu9d61

0,1
Comm: tcp://10.158.148.14:38677,Total threads: 1
Dashboard: http://10.158.148.14:34711/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.14:46323,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sklgr88h,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sklgr88h

0,1
Comm: tcp://10.158.106.8:45119,Total threads: 1
Dashboard: http://10.158.106.8:36839/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:37955,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sfbdbcdt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-sfbdbcdt

0,1
Comm: tcp://10.158.96.38:44979,Total threads: 1
Dashboard: http://10.158.96.38:40341/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:33605,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pgtjxc72,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pgtjxc72

0,1
Comm: tcp://10.158.106.8:38523,Total threads: 1
Dashboard: http://10.158.106.8:38475/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:34589,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dmz2hbs0,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-dmz2hbs0

0,1
Comm: tcp://10.158.106.8:43545,Total threads: 1
Dashboard: http://10.158.106.8:45301/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:38703,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-416ey_dc,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-416ey_dc

0,1
Comm: tcp://10.158.111.40:44265,Total threads: 1
Dashboard: http://10.158.111.40:33313/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:41483,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b6jnfxcp,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-b6jnfxcp

0,1
Comm: tcp://10.158.100.40:46071,Total threads: 1
Dashboard: http://10.158.100.40:39013/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.40:32791,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_d118p0p,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_d118p0p

0,1
Comm: tcp://10.158.148.25:36773,Total threads: 1
Dashboard: http://10.158.148.25:33589/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:39615,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-hb1o4atd,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-hb1o4atd

0,1
Comm: tcp://10.158.148.19:36401,Total threads: 1
Dashboard: http://10.158.148.19:33599/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:46065,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-w4cigl4z,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-w4cigl4z

0,1
Comm: tcp://10.158.111.40:36787,Total threads: 1
Dashboard: http://10.158.111.40:38901/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:43761,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ivp5vujg,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ivp5vujg

0,1
Comm: tcp://10.158.148.25:34769,Total threads: 1
Dashboard: http://10.158.148.25:43453/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:40911,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6jx7qe5a,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6jx7qe5a

0,1
Comm: tcp://10.158.148.63:43245,Total threads: 1
Dashboard: http://10.158.148.63:33743/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:37253,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2etxd2u8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2etxd2u8

0,1
Comm: tcp://10.158.106.57:37123,Total threads: 1
Dashboard: http://10.158.106.57:33911/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:33443,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9kfiammt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9kfiammt

0,1
Comm: tcp://10.158.106.24:45823,Total threads: 1
Dashboard: http://10.158.106.24:35793/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.24:42819,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xbzx3txk,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xbzx3txk

0,1
Comm: tcp://10.158.111.40:42289,Total threads: 1
Dashboard: http://10.158.111.40:37679/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:34745,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xh3lnump,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xh3lnump

0,1
Comm: tcp://10.158.106.8:45809,Total threads: 1
Dashboard: http://10.158.106.8:40551/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:37515,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2vauiswq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2vauiswq

0,1
Comm: tcp://10.158.96.44:39615,Total threads: 1
Dashboard: http://10.158.96.44:34601/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.44:46571,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-a0qoiljx,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-a0qoiljx

0,1
Comm: tcp://10.158.148.19:37071,Total threads: 1
Dashboard: http://10.158.148.19:42913/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:38685,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ekmuw01d,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ekmuw01d

0,1
Comm: tcp://10.158.148.25:34175,Total threads: 1
Dashboard: http://10.158.148.25:41219/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:45173,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1e7q59ml,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-1e7q59ml

0,1
Comm: tcp://10.158.96.54:34853,Total threads: 1
Dashboard: http://10.158.96.54:33131/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.54:44181,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-agu64vuo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-agu64vuo

0,1
Comm: tcp://10.158.148.19:34203,Total threads: 1
Dashboard: http://10.158.148.19:37027/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:38029,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ifgnottv,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ifgnottv

0,1
Comm: tcp://10.158.106.57:46379,Total threads: 1
Dashboard: http://10.158.106.57:37819/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:36357,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mlgvyt74,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-mlgvyt74

0,1
Comm: tcp://10.158.148.60:42573,Total threads: 1
Dashboard: http://10.158.148.60:44055/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.60:34313,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ac7dej7s,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ac7dej7s

0,1
Comm: tcp://10.158.148.63:34113,Total threads: 1
Dashboard: http://10.158.148.63:41211/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:37947,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6e95pvbp,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6e95pvbp

0,1
Comm: tcp://10.158.111.40:38377,Total threads: 1
Dashboard: http://10.158.111.40:45373/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:45665,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bv1m3cta,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-bv1m3cta

0,1
Comm: tcp://10.158.96.7:34705,Total threads: 1
Dashboard: http://10.158.96.7:32955/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.7:40521,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-8i_fu94u,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-8i_fu94u

0,1
Comm: tcp://10.158.111.40:44713,Total threads: 1
Dashboard: http://10.158.111.40:38807/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:36177,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2kcsh3ga,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2kcsh3ga

0,1
Comm: tcp://10.158.100.42:34095,Total threads: 1
Dashboard: http://10.158.100.42:40353/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.42:45827,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-rusp4n5g,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-rusp4n5g

0,1
Comm: tcp://10.158.96.38:39257,Total threads: 1
Dashboard: http://10.158.96.38:33207/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:45649,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0mbwfdyq,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0mbwfdyq

0,1
Comm: tcp://10.158.148.25:43975,Total threads: 1
Dashboard: http://10.158.148.25:42777/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:44413,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2upq5dnc,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-2upq5dnc

0,1
Comm: tcp://10.158.100.40:43613,Total threads: 1
Dashboard: http://10.158.100.40:34303/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.40:36289,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xr8s8muf,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-xr8s8muf

0,1
Comm: tcp://10.158.96.38:40473,Total threads: 1
Dashboard: http://10.158.96.38:44945/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:38631,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6ghoaznc,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-6ghoaznc

0,1
Comm: tcp://10.158.106.37:46041,Total threads: 1
Dashboard: http://10.158.106.37:44015/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.37:42765,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-7bpf2ebo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-7bpf2ebo

0,1
Comm: tcp://10.158.106.8:38697,Total threads: 1
Dashboard: http://10.158.106.8:41725/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:34733,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4l1g_qai,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4l1g_qai

0,1
Comm: tcp://10.158.148.19:37009,Total threads: 1
Dashboard: http://10.158.148.19:41083/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:34711,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-z7b2yqhy,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-z7b2yqhy

0,1
Comm: tcp://10.158.96.38:38671,Total threads: 1
Dashboard: http://10.158.96.38:36853/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:38805,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ljmj251d,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ljmj251d

0,1
Comm: tcp://10.158.100.143:44445,Total threads: 1
Dashboard: http://10.158.100.143:45913/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.143:38627,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-rcgltbpt,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-rcgltbpt

0,1
Comm: tcp://10.158.106.11:36927,Total threads: 1
Dashboard: http://10.158.106.11:45393/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.11:36609,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kthqqcr3,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-kthqqcr3

0,1
Comm: tcp://10.158.106.57:45091,Total threads: 1
Dashboard: http://10.158.106.57:44043/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:36227,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-t823ms31,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-t823ms31

0,1
Comm: tcp://10.158.111.40:46693,Total threads: 1
Dashboard: http://10.158.111.40:35061/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:38455,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-38cw13np,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-38cw13np

0,1
Comm: tcp://10.158.100.42:40637,Total threads: 1
Dashboard: http://10.158.100.42:39443/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.42:46425,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ardliafy,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ardliafy

0,1
Comm: tcp://10.158.111.40:42949,Total threads: 1
Dashboard: http://10.158.111.40:41149/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:41223,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-g6fg1g0c,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-g6fg1g0c

0,1
Comm: tcp://10.158.148.63:34325,Total threads: 1
Dashboard: http://10.158.148.63:41551/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:40021,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0acga4ok,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-0acga4ok

0,1
Comm: tcp://10.158.100.40:40499,Total threads: 1
Dashboard: http://10.158.100.40:42943/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.40:43829,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-m5xzfj0j,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-m5xzfj0j

0,1
Comm: tcp://10.158.111.40:38803,Total threads: 1
Dashboard: http://10.158.111.40:38853/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:35019,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ipb5hml3,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ipb5hml3

0,1
Comm: tcp://10.158.106.8:46025,Total threads: 1
Dashboard: http://10.158.106.8:35023/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.8:45299,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9jdmlh3k,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-9jdmlh3k

0,1
Comm: tcp://10.158.148.19:44405,Total threads: 1
Dashboard: http://10.158.148.19:43965/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:43439,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vidra8v8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-vidra8v8

0,1
Comm: tcp://10.158.148.63:38387,Total threads: 1
Dashboard: http://10.158.148.63:36287/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.63:46341,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jqy23qy8,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-jqy23qy8

0,1
Comm: tcp://10.158.148.25:38855,Total threads: 1
Dashboard: http://10.158.148.25:38371/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:46511,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3_4l6l74,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-3_4l6l74

0,1
Comm: tcp://10.158.148.19:33085,Total threads: 1
Dashboard: http://10.158.148.19:44103/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.19:37863,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pksnqp23,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-pksnqp23

0,1
Comm: tcp://10.158.100.40:41259,Total threads: 1
Dashboard: http://10.158.100.40:36727/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.40:33685,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fy1umrlx,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-fy1umrlx

0,1
Comm: tcp://10.158.111.40:38633,Total threads: 1
Dashboard: http://10.158.111.40:34577/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:44687,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-afi2p6pl,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-afi2p6pl

0,1
Comm: tcp://10.158.106.57:42753,Total threads: 1
Dashboard: http://10.158.106.57:38719/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:43639,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ip3gzr6e,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-ip3gzr6e

0,1
Comm: tcp://10.158.106.57:35047,Total threads: 1
Dashboard: http://10.158.106.57:33323/status,Memory: 27.94 GiB
Nanny: tcp://10.158.106.57:39881,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4szuzyof,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-4szuzyof

0,1
Comm: tcp://10.158.148.25:33863,Total threads: 1
Dashboard: http://10.158.148.25:33733/status,Memory: 27.94 GiB
Nanny: tcp://10.158.148.25:36607,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-s2_7ni0q,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-s2_7ni0q

0,1
Comm: tcp://10.158.111.40:39681,Total threads: 1
Dashboard: http://10.158.111.40:42771/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:44213,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-zob8ye_s,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-zob8ye_s

0,1
Comm: tcp://10.158.111.40:33003,Total threads: 1
Dashboard: http://10.158.111.40:43295/status,Memory: 27.94 GiB
Nanny: tcp://10.158.111.40:36653,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-djow16qw,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-djow16qw

0,1
Comm: tcp://10.158.100.42:46093,Total threads: 1
Dashboard: http://10.158.100.42:33219/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.42:42463,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-5ce2eb3p,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-5ce2eb3p

0,1
Comm: tcp://10.158.100.40:32899,Total threads: 1
Dashboard: http://10.158.100.40:44771/status,Memory: 27.94 GiB
Nanny: tcp://10.158.100.40:41369,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_6glg3qo,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-_6glg3qo

0,1
Comm: tcp://10.158.96.38:37541,Total threads: 1
Dashboard: http://10.158.96.38:34259/status,Memory: 27.94 GiB
Nanny: tcp://10.158.96.38:45571,
Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-91qixbrr,Local directory: /tmp/zmbc_dask/dask-scratch-space/worker-91qixbrr


In [10]:
! date

Wed 15 May 2024 10:15:24 PM PDT


In [11]:
psp.__version__

'1.0.1.dev20+g07b85e8'

## Load simulated data

In [12]:
if data_to_use == "small_sample":
    pseudopeople_input_dir = None
elif data_to_use == "ri":
    pseudopeople_input_dir = "/mnt/team/simulation_science/pub/models/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_08_21_16_35_27/final_results/2023_08_31_15_58_01/states/pseudopeople_simulated_population_rhode_island_2_0_0/"
elif data_to_use == "usa":
    pseudopeople_input_dir = "/mnt/team/simulation_science/pub/models/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_08_21_16_35_27/final_results/2023_08_31_15_58_01/pseudopeople_simulated_population_usa_2_0_0/"
else:
    raise ValueError()

In [13]:
psp_kwargs = {
    "source": pseudopeople_input_dir,
    "seed": pseudopeople_seed,
}
if compute_engine.startswith("modin"):
    psp_kwargs["engine"] = "modin"
if compute_engine.startswith("dask"):
    psp_kwargs["engine"] = "dask"

### Noise configuration

In order to give ourselves more of a challenge, we significantly increase the amount of noise
from the pseudopeople defaults.

In [14]:
default_configuration = psp.get_config()

In [15]:
# Helper functions for changing the default configuration according to a pattern
def column_noise_value(dataset, column, noise_type, default_value):
    if very_noisy and dataset in (
        "decennial_census",
        "taxes_w2_and_1099",
        "social_security",
    ):
        if noise_type == "make_typos":
            if column == "middle_initial":
                # 5% of middle initials (which are all a single token anyway) are wrong.
                return {"cell_probability": 0.05, "token_probability": 1}
            elif column in ("first_name", "last_name", "street_name"):
                # 10% of these text columns were entered carelessly, at a rate of 1 error
                # per 10 characters.
                # The pseudopeople default is 1% careless.
                return {"cell_probability": 0.1, "token_probability": 0.1}
        elif noise_type == "write_wrong_digits" and (
            dataset != "social_security" or column != "ssn"
        ):
            # 10% of number columns were written carelessly, at a rate of 1 error
            # per 10 characters.
            # The pseudopeople default is 1% careless.
            # Note that this is applied on top of (the default lower levels of) typos,
            # since typos also apply to numeric characters.
            # We never introduce error on the SSN in the SSA dataset
            return {"cell_probability": 0.1, "token_probability": 0.1}

    return default_value


def row_noise_value(dataset, noise_type, default_value):
    return default_value

In [16]:
custom_configuration = {
    dataset: {
        noise_category: (
            (
                {
                    column: {
                        noise_type: column_noise_value(
                            dataset, column, noise_type, noise_type_config
                        )
                        for noise_type, noise_type_config in column_config.items()
                    }
                    for column, column_config in noise_category_config.items()
                }
                if noise_category == "column_noise"
                else {
                    noise_type: row_noise_value(dataset, noise_type, noise_type_config)
                    for noise_type, noise_type_config in noise_category_config.items()
                }
            )
        )
        for noise_category, noise_category_config in dataset_config.items()
    }
    for dataset, dataset_config in default_configuration.items()
}

In [17]:
psp_kwargs["config"] = custom_configuration

### Simulated 1040 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files -- see section about reference files below.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [18]:
tax_years = list(range(2025, 2030))
tax_years

[2025, 2026, 2027, 2028, 2029]

In [19]:
psp_kwargs

{'source': '/mnt/team/simulation_science/pub/models/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_08_21_16_35_27/final_results/2023_08_31_15_58_01/states/pseudopeople_simulated_population_rhode_island_2_0_0/',
 'seed': 1,
 'engine': 'dask',
 'config': {'decennial_census': {'row_noise': {'do_not_respond': {'row_probability': 0.0145},
    'omit_row': {'row_probability': 0.0},
    'duplicate_with_guardian': {'row_probability_in_households_under_18': 0.02,
     'row_probability_in_college_group_quarters_under_24': 0.05}},
   'column_noise': {'first_name': {'leave_blank': {'cell_probability': 0.01},
     'use_nickname': {'cell_probability': 0.01},
     'use_fake_name': {'cell_probability': 0.01},
     'make_phonetic_errors': {'cell_probability': 0.01,
      'token_probability': 0.1},
     'make_ocr_errors': {'cell_probability': 0.01, 'token_probability': 0.1},
     'make_typos': {'cell_probability': 0.01, 'token_probability': 0.1}},
    'mid

In [20]:
%%time

for year in tax_years:
    print(year)
    df = psp.generate_taxes_1040(
        year=year,
        **psp_kwargs,
    )
    utils.remove_path(str(Path(output_dir) / f"simulated_taxes_1040_{year}.parquet"))
    df.to_parquet(str(Path(output_dir) / f"simulated_taxes_1040_{year}.parquet"))

2025
[32m2024-05-15 22:15:25.369[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_4_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


2026
[32m2024-05-15 22:18:42.625[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_4_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


2027
[32m2024-05-15 22:21:05.196[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_4_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


[32m2024-05-15 22:21:05.198[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'copy_from_household_member' noise level for column_noise 'dependent_4_ssn' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


2028
[32m2024-05-15 22:23:46.618[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_4_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


[32m2024-05-15 22:23:46.620[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'copy_from_household_member' noise level for column_noise 'dependent_4_ssn' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


2029


[32m2024-05-15 22:26:28.037[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_3_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


[32m2024-05-15 22:26:28.038[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'use_nickname' noise level for column_noise 'dependent_4_first_name' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


[32m2024-05-15 22:26:28.039[0m | [36mpseudopeople.configuration.validator[0m:[36mvalidate_noise_level_proportions[0m:[36m335[0m - [33m[1mThe configured 'copy_from_household_member' noise level for column_noise 'dependent_4_ssn' is 0.01, which is higher than the maximum possible value based on the provided data for 'taxes_1040'. Noising as many rows as possible. [0m


CPU times: user 1min 47s, sys: 10.8 s, total: 1min 58s
Wall time: 13min 40s


### Simulated W2/1099 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [21]:
%%time

for year in tax_years:
    print(year)
    df = psp.generate_taxes_w2_and_1099(
        year=year,
        **psp_kwargs,
    )
    utils.remove_path(
        str(Path(output_dir) / f"simulated_taxes_w2_and_1099_{year}.parquet")
    )
    df.to_parquet(str(Path(output_dir) / f"simulated_taxes_w2_and_1099_{year}.parquet"))

2025


2026


2027


2028


2029


CPU times: user 1min 32s, sys: 10.7 s, total: 1min 42s
Wall time: 11min 25s


### Simulated 2030 Census Unedited File (CUF)

For now, we gloss over the data schema for addresses.
We don't know how addresses would be formatted in the CUF (and it's hard to guess, because
address is not part of the Census form), but it likely would have some of these fields
(street number, street name, etc) combined.

While PVS input files do not in general have names split into first, middle, and last,
I am guessing the CUF **would** have first name, middle initial, last name (which is how pseudopeople
generates it), because that [matches the Census questionnaire](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/questionnaires-and-instructions/questionnaires/2020-informational-questionnaire-english_DI-Q1.pdf).

In [22]:
%%time

simulated_census_2030 = psp.generate_decennial_census(
    year=2030,
    **psp_kwargs,
)
utils.remove_path(str(Path(output_dir) / f"simulated_census_2030.parquet"))
simulated_census_2030.to_parquet(
    str(Path(output_dir) / f"simulated_census_2030.parquet")
)

CPU times: user 13.5 s, sys: 1.51 s, total: 15 s
Wall time: 1min 21s


### Simulated SSA Numident

Wagner and Layne, p.4:

> The reference files are derived from the Social Security Administration
    (SSA) Numerical Identification file (SSA Numident). The Numident contains all
    transactions recorded against one Social Security Number (SSN)...

Based on the [SSA Numident through 2007 which is publicly available from NARA](https://aad.archives.gov/aad/series-description.jsp?s=5057),
we know there are three kinds of transactions: SSN applications, deaths, and claiming benefits.
SSN holders may change their information (e.g. changing name or sex) by submitting another application,
which generates an additional application transaction.
(The policies about this are found [on the SSA website](https://secure.ssa.gov/poms.nsf/lnx/0110212200).)

The paper ["Likely Transgender Individuals in U.S. Federal Administrative Records and the 2010 Census" by Benjamin Cerf Harris](https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-03.pdf)
includes some helpful statistics (Table 2).
The average person in the SSA Numident has 2.2 transactions (called "claims" in that paper, but with the same definition
as our term "transaction": "Any time an SSN is created or information associated with an existing SSN is changed, that event is registered
as a claim.").

pseudopeople does not currently include correction, name change, or benefits claim transactions.
It only includes SSN creation and death of the SSN holder.

I've figured that there would be some delay in getting the Numident -- so by Census processing time
for the 2030 Census, only the SSA transactions by the end of 2029 would be available.
Note that with pseudopeople's current design it is only possible to set this cutoff at the end of a calendar year.
The NORC report says that "the Census NUMIDENT is recreated each year, to reflect
Social Security transaction records through **March** of each year" (p. 105),
though it isn't clear when in the year the Census Numident is actually re-created.

In [23]:
%%time

simulated_ssa_numident = psp.generate_social_security(
    year=2029,
    **psp_kwargs,
)

CPU times: user 7.58 s, sys: 946 ms, total: 8.53 s
Wall time: 56 s


In [24]:
%%time

utils.remove_path(str(Path(output_dir) / "simulated_ssa_numident.parquet"))
simulated_ssa_numident.to_parquet(
    str(Path(output_dir) / "simulated_ssa_numident.parquet")
)

CPU times: user 12.8 s, sys: 1.77 s, total: 14.6 s
Wall time: 2min 3s


In [25]:
! date

Wed 15 May 2024 10:44:52 PM PDT
