# Generate pseudopeople simulated datasets

The very first step is generating pseudopeople data that will be used both directly in the case study, and to create the reference files.
Since this is an intensive operation and currently can only be distributed with Modin, we do only this step in this notebook, then
save the results.

In [1]:
import warnings
import pseudopeople as psp
import numpy as np
import os
import logging
from pathlib import Path
# Importing pandas for access, regardless of whether we are using it as the compute engine
import pandas

In [2]:
warnings.simplefilter(action='ignore', category=FutureWarning)

In [3]:
# DO NOT EDIT if this notebook is not called generate_pseudopeople_simulated_datasets.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
# When you run this, save it to another filename.
data_to_use = 'small_sample'
output_dir = 'output'
compute_engine = 'pandas'
# Only matter if distributing
num_jobs = 100
cpus_per_job = 5
memory_per_job = "30GB"
very_noisy = True
pseudopeople_seed = 0

In [4]:
# Parameters
data_to_use = "usa"
output_dir = "/ihme/scratch/users/zmbc/pvs_like_case_study/generate_simulated_data/"
compute_engine = "modin_dask_distributed"
num_jobs = 100
memory_per_job = "30GB"
cpus_per_job = 5
pseudopeople_seed = 1


In [5]:
output_dir = str(Path(output_dir) / data_to_use / "pseudopeople_simulated_datasets")

In [6]:
! date

Mon Nov 20 13:19:38 PST 2023


In [7]:
psp.__version__

'0.8.3.dev6+g31db93a'

In [8]:
if compute_engine == 'pandas':
    import pandas as pd
elif compute_engine.startswith('modin'):
    if compute_engine.startswith('modin_dask_'):
        import modin.config as modin_cfg
        modin_cfg.Engine.put("dask") # Use dask instead of ray (which is the default)

        import dask

        if compute_engine == 'modin_dask_distributed':
            from dask_jobqueue import SLURMCluster

            cluster = SLURMCluster(
                queue='all.q',
                account="proj_simscience",
                # If you give dask workers more than one core, they will use it to
                # run more tasks at once.
                # To have more than one thread per worker but use them all for
                # multi-threading code in one task
                # at a time, you have to set cores=1, processes=1 and job_cpu > 1.
                cores=1,
                processes=1,
                memory=memory_per_job,
                walltime="1-00:00:00",
                # Dask distributed looks at OS-reported memory to decide whether a worker is running out.
                # If the memory allocator is not returning the memory to the OS promptly (even when holding onto it
                # is smart), it will lead Dask to make bad decisions.
                # By default, pyarrow uses jemalloc, but I could not get that to release memory quickly.
                # Even this doesn't seem to be completely working, but in combination with small-ish partitions
                # it seems to do okay -- unmanaged memory does seem to shrink from time to time, which it wasn't
                # previously doing.
                job_script_prologue="export ARROW_DEFAULT_MEMORY_POOL=system\nexport MALLOC_TRIM_THRESHOLD_=0",
                job_cpu=cpus_per_job,
            )

            cluster.scale(n=num_jobs)
            # Supposedly, this will start new jobs if the existing
            # ones fail for some reason.
            # https://stackoverflow.com/a/61295019
            cluster.adapt(minimum_jobs=num_jobs, maximum_jobs=num_jobs)

            from distributed import Client
            client = Client(cluster)
        else:
            from distributed import Client
            cpus_available = int(os.environ['SLURM_CPUS_ON_NODE'])
            client = Client(n_workers=int(cpus_available / 2), threads_per_worker=2)

        # Why is this necessary?!
        # For some reason, if I don't set NPartitions, it seems to default to 0?!
        num_row_groups = 1 if data_to_use == 'small_sample' else 334
        modin_cfg.NPartitions.put(min(num_jobs * 3, num_row_groups))
        modin_cfg.MinPartitionSize.put(1_000) # ensure no column-axis partitions -- they'll need to be joined up right away anyway by our row-wise noising

        display(client)
    elif compute_engine == 'modin_ray':
        # Haven't worked on distributing this across multiple nodes
        import ray
        ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}}, num_cpus=int(os.environ['SLURM_CPUS_ON_NODE']))
    else:
        # Use serial Python backend (good for debugging errors)
        import modin.config as modin_cfg
        modin_cfg.IsDebug.put(True)

    import modin.pandas as pd

    # https://modin.readthedocs.io/en/stable/usage_guide/advanced_usage/progress_bar.html
    from modin.config import ProgressBar
    ProgressBar.enable()

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.158.99.29:8787/status,

0,1
Dashboard: http://10.158.99.29:8787/status,Workers: 0
Total threads: 0,Total memory: 0 B

0,1
Comm: tcp://10.158.99.29:33597,Workers: 0
Dashboard: http://10.158.99.29:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


## Load simulated data

In [9]:
assert data_to_use in ('small_sample', 'usa')
pseudopeople_input_dir = None if data_to_use == 'small_sample' else '/mnt/team/simulation_science/priv/engineering/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_07_28_08_33_09/final_results/2023_08_16_09_58_54/pseudopeople_input_data_usa/'

In [10]:
psp_kwargs = {
    'source': pseudopeople_input_dir,
    'seed': pseudopeople_seed,
}
if 'modin' in compute_engine:
    psp_kwargs['engine'] = 'modin'

### Noise configuration

In order to give ourselves more of a challenge, we significantly increase the amount of noise
from the pseudopeople defaults.

In [11]:
default_configuration = psp.get_config()

In [12]:
# Helper functions for changing the default configuration according to a pattern
def column_noise_value(dataset, column, noise_type, default_value):
    if very_noisy and dataset in ('decennial_census', 'taxes_w2_and_1099', 'social_security'):
        if noise_type == "make_typos":
            if column == "middle_initial":
                # 5% of middle initials (which are all a single token anyway) are wrong.
                return {"cell_probability": 0.05, "token_probability": 1}
            elif column in ("first_name", "last_name", "street_name"):
                # 10% of these text columns were entered carelessly, at a rate of 1 error
                # per 10 characters.
                # The pseudopeople default is 1% careless.
                return {"cell_probability": 0.1, "token_probability": 0.1}
        elif noise_type == "write_wrong_digits" and (dataset != "social_security" or column != "ssn"):
            # 10% of number columns were written carelessly, at a rate of 1 error
            # per 10 characters.
            # The pseudopeople default is 1% careless.
            # Note that this is applied on top of (the default lower levels of) typos,
            # since typos also apply to numeric characters.
            # We never introduce error on the SSN in the SSA dataset
            return {"cell_probability": 0.1, "token_probability": 0.1}

    return default_value


def row_noise_value(dataset, noise_type, default_value):
    return default_value

In [13]:
custom_configuration = {
    dataset: {
        noise_category: (
            ({
                column: {
                    noise_type: column_noise_value(dataset, column, noise_type, noise_type_config)
                    for noise_type, noise_type_config in column_config.items()
                }
                for column, column_config in noise_category_config.items()
            }
            if noise_category == "column_noise" else
            {
                noise_type: row_noise_value(dataset, noise_type, noise_type_config)
                for noise_type, noise_type_config in noise_category_config.items()
            })
        )
        for noise_category, noise_category_config in dataset_config.items()
    }
    for dataset, dataset_config in default_configuration.items()
}

In [14]:
psp_kwargs['config'] = custom_configuration

### Simulated SSA Numident

Wagner and Layne, p.4:

> The reference files are derived from the Social Security Administration
    (SSA) Numerical Identification file (SSA Numident). The Numident contains all
    transactions recorded against one Social Security Number (SSN)...

Based on the [SSA Numident through 2007 which is publicly available from NARA](https://aad.archives.gov/aad/series-description.jsp?s=5057),
we know there are three kinds of transactions: SSN applications, deaths, and claiming benefits.
SSN holders may change their information (e.g. changing name or sex) by submitting another application,
which generates an additional application transaction.
(The policies about this are found [on the SSA website](https://secure.ssa.gov/poms.nsf/lnx/0110212200).)

The paper ["Likely Transgender Individuals in U.S. Federal Administrative Records and the 2010 Census" by Benjamin Cerf Harris](https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-03.pdf)
includes some helpful statistics (Table 2).
The average person in the SSA Numident has 2.2 transactions (called "claims" in that paper, but with the same definition
as our term "transaction": "Any time an SSN is created or information associated with an existing SSN is changed, that event is registered
as a claim.").

pseudopeople does not currently include correction, name change, or benefits claim transactions.
It only includes SSN creation and death of the SSN holder.

I've figured that there would be some delay in getting the Numident -- so by Census processing time
for the 2030 Census, only the SSA transactions by the end of 2029 would be available.
Note that with pseudopeople's current design it is only possible to set this cutoff at the end of a calendar year.
The NORC report says that "the Census NUMIDENT is recreated each year, to reflect
Social Security transaction records through **March** of each year" (p. 105),
though it isn't clear when in the year the Census Numident is actually re-created.

In [15]:
%%time

simulated_ssa_numident = psp.generate_social_security(
    year=2029,
    **psp_kwargs,
)

CPU times: user 14.9 s, sys: 6.04 s, total: 20.9 s
Wall time: 4min 17s


In [16]:
%%time

simulated_ssa_numident.to_parquet(str(Path(output_dir) / "simulated_ssa_numident.parquet"))

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 59.4 s, sys: 30 s, total: 1min 29s
Wall time: 19min 5s


### Simulated 1040 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files -- see section about reference files below.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [17]:
tax_years = list(range(2025, 2030))
tax_years

[2025, 2026, 2027, 2028, 2029]

In [18]:
%%time

for year in tax_years:
    print(year)
    psp.generate_taxes_1040(
        year=year,
        **psp_kwargs,
    ).to_parquet(str(Path(output_dir) / f"simulated_taxes_1040_{year}.parquet"))

2025


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2026


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2027


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2028


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2029


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 15min 5s, sys: 1min 49s, total: 16min 54s
Wall time: 54min 31s


### Simulated W2/1099 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [19]:
%%time

for year in tax_years:
    print(year)
    psp.generate_taxes_w2_and_1099(
        year=year,
        **psp_kwargs,
    ).to_parquet(str(Path(output_dir) / f"simulated_taxes_w2_and_1099_{year}.parquet"))

2025


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2026


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2027


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2028


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


2029


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 10min 18s, sys: 1min 27s, total: 11min 46s
Wall time: 29min 24s


### Simulated 2030 Census Unedited File (CUF)

For now, we gloss over the data schema for addresses.
We don't know how addresses would be formatted in the CUF (and it's hard to guess, because
address is not part of the Census form), but it likely would have some of these fields
(street number, street name, etc) combined.

While PVS input files do not in general have names split into first, middle, and last,
I am guessing the CUF **would** have first name, middle initial, last name (which is how pseudopeople
generates it), because that [matches the Census questionnaire](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/questionnaires-and-instructions/questionnaires/2020-informational-questionnaire-english_DI-Q1.pdf).

In [20]:
%%time

simulated_census_2030 = psp.generate_decennial_census(
    year=2030,
    **psp_kwargs,
)
simulated_census_2030.to_parquet(str(Path(output_dir) / f"simulated_census_2030.parquet"))

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 1min 43s, sys: 15.8 s, total: 1min 59s
Wall time: 4min 36s


In [21]:
! date

Mon Nov 20 15:11:44 PST 2023
