# Generate pseudopeople simulated datasets

The very first step is generating pseudopeople data that will be used both directly in the case study, and to create the reference files.
Since this is an intensive operation and currently can only be distributed with Modin, we do only this step in this notebook, then
save the results.

In [1]:
import warnings
import pseudopeople as psp
import numpy as np
import os, shutil
import logging
from pathlib import Path
# Importing pandas for access, regardless of whether we are using it as the compute engine
import pandas

In [2]:
%load_ext autoreload
%autoreload 1

In [3]:
%aimport vivarium_research_prl
from vivarium_research_prl import distributed_compute, utils

In [4]:
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
# DO NOT EDIT if this notebook is not called generate_pseudopeople_simulated_datasets.ipynb!
# This notebook is designed to be run with papermill; this cell is tagged 'parameters'
# When you run this, save it to another filename.
data_to_use = 'small_sample'
output_dir = 'output'
compute_engine = 'pandas'
# Only matter if distributing
num_jobs = 5
cpus_per_job = 2
memory_per_job = "10GB"
very_noisy = True
pseudopeople_seed = 0

In [6]:
output_dir = str(Path(output_dir) / data_to_use / "pseudopeople_simulated_datasets")

In [7]:
df_ops, pd = distributed_compute.start_compute_engine(
    compute_engine,
    num_jobs=num_jobs,
    cpus_per_job=cpus_per_job,
    memory_per_job=memory_per_job,
)

In [8]:
! date

Thu 08 Feb 2024 08:49:15 AM PST


In [9]:
psp.__version__

'0.8.3.dev6+g31db93a'

## Load simulated data

In [10]:
if data_to_use == 'small_sample':
    pseudopeople_input_dir = None
elif data_to_use == 'ri':
    pseudopeople_input_dir = '/mnt/team/simulation_science/priv/engineering/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_07_28_08_33_09/final_results/2023_08_16_09_58_54/states/pseudopeople_input_data_rhode_island/'
elif data_to_use == 'usa':
    pseudopeople_input_dir = '/mnt/team/simulation_science/priv/engineering/vivarium_census_prl_synth_pop/results/release_02_yellow/full_data/united_states_of_america/2023_07_28_08_33_09/final_results/2023_08_16_09_58_54/pseudopeople_input_data_usa/'
else:
    raise ValueError()

In [11]:
psp_kwargs = {
    'source': pseudopeople_input_dir,
    'seed': pseudopeople_seed,
}
if 'modin' in compute_engine:
    psp_kwargs['engine'] = 'modin'

### Noise configuration

In order to give ourselves more of a challenge, we significantly increase the amount of noise
from the pseudopeople defaults.

In [12]:
default_configuration = psp.get_config()

In [13]:
# Helper functions for changing the default configuration according to a pattern
def column_noise_value(dataset, column, noise_type, default_value):
    if very_noisy and dataset in ('decennial_census', 'taxes_w2_and_1099', 'social_security'):
        if noise_type == "make_typos":
            if column == "middle_initial":
                # 5% of middle initials (which are all a single token anyway) are wrong.
                return {"cell_probability": 0.05, "token_probability": 1}
            elif column in ("first_name", "last_name", "street_name"):
                # 10% of these text columns were entered carelessly, at a rate of 1 error
                # per 10 characters.
                # The pseudopeople default is 1% careless.
                return {"cell_probability": 0.1, "token_probability": 0.1}
        elif noise_type == "write_wrong_digits" and (dataset != "social_security" or column != "ssn"):
            # 10% of number columns were written carelessly, at a rate of 1 error
            # per 10 characters.
            # The pseudopeople default is 1% careless.
            # Note that this is applied on top of (the default lower levels of) typos,
            # since typos also apply to numeric characters.
            # We never introduce error on the SSN in the SSA dataset
            return {"cell_probability": 0.1, "token_probability": 0.1}

    return default_value


def row_noise_value(dataset, noise_type, default_value):
    return default_value

In [14]:
custom_configuration = {
    dataset: {
        noise_category: (
            ({
                column: {
                    noise_type: column_noise_value(dataset, column, noise_type, noise_type_config)
                    for noise_type, noise_type_config in column_config.items()
                }
                for column, column_config in noise_category_config.items()
            }
            if noise_category == "column_noise" else
            {
                noise_type: row_noise_value(dataset, noise_type, noise_type_config)
                for noise_type, noise_type_config in noise_category_config.items()
            })
        )
        for noise_category, noise_category_config in dataset_config.items()
    }
    for dataset, dataset_config in default_configuration.items()
}

In [15]:
psp_kwargs['config'] = custom_configuration

In [16]:
if data_to_use == 'ri':
    # TODO: Test whether this works on Modin.
    simulants_ever_observed = df_ops.empty_dataframe(columns=['simulant_id'], dtype=str)

### Simulated 1040 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files -- see section about reference files below.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [17]:
tax_years = list(range(2025, 2030))
tax_years

[2025, 2026, 2027, 2028, 2029]

In [18]:
%%time

for year in tax_years:
    print(year)
    df = psp.generate_taxes_1040(
        year=year,
        **psp_kwargs,
    )
    if data_to_use == 'ri':
        simulants_ever_observed = df_ops.persist(df_ops.drop_duplicates(df_ops.concat([
            simulants_ever_observed,
            df[['simulant_id']]
        ], ignore_index=True)))
    utils.remove_path(str(Path(output_dir) / f"simulated_taxes_1040_{year}.parquet"))
    df.to_parquet(str(Path(output_dir) / f"simulated_taxes_1040_{year}.parquet"))

2025


                                                                

2026


                                                                 

2027


                                                                 

2028


                                                                

2029


                                                                

CPU times: user 4.86 s, sys: 409 ms, total: 5.27 s
Wall time: 4.19 s




2029


                                                                 

CPU times: user 4.27 s, sys: 315 ms, total: 4.58 s
Wall time: 3.58 s




### Simulated W2/1099 tax filings

We assume that the last 5 years of taxes would be available and used in the construction of the reference files.

Note that these are retrieved by *tax* year, so the 2029 taxes would be available in early 2030
(around when our hypothetical case study is taking place).

In [19]:
%%time

for year in tax_years:
    print(year)
    df = psp.generate_taxes_w2_and_1099(
        year=year,
        **psp_kwargs,
    )
    if data_to_use == 'ri':
        simulants_ever_observed = df_ops.persist(df_ops.drop_duplicates(df_ops.concat([
            simulants_ever_observed,
            df[['simulant_id']]
        ], ignore_index=True)))
    utils.remove_path(str(Path(output_dir) / f"simulated_taxes_w2_and_1099_{year}.parquet"))
    df.to_parquet(str(Path(output_dir) / f"simulated_taxes_w2_and_1099_{year}.parquet"))

2025


                                                                

2026


                                                                

2027


                                                                

2028


                                                                

2029


                                                                

CPU times: user 3.43 s, sys: 356 ms, total: 3.78 s
Wall time: 3.03 s


                                                                 

2029


                                                                 

CPU times: user 3.06 s, sys: 212 ms, total: 3.27 s
Wall time: 2.57 s


### Simulated 2030 Census Unedited File (CUF)

For now, we gloss over the data schema for addresses.
We don't know how addresses would be formatted in the CUF (and it's hard to guess, because
address is not part of the Census form), but it likely would have some of these fields
(street number, street name, etc) combined.

While PVS input files do not in general have names split into first, middle, and last,
I am guessing the CUF **would** have first name, middle initial, last name (which is how pseudopeople
generates it), because that [matches the Census questionnaire](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/questionnaires-and-instructions/questionnaires/2020-informational-questionnaire-english_DI-Q1.pdf).

In [20]:
%%time

simulated_census_2030 = psp.generate_decennial_census(
    year=2030,
    **psp_kwargs,
)
if data_to_use == 'ri':
    simulants_ever_observed = df_ops.persist(df_ops.drop_duplicates(df_ops.concat([
        simulants_ever_observed,
        simulated_census_2030[['simulant_id']]
    ], ignore_index=True)))
utils.remove_path(str(Path(output_dir) / f"simulated_census_2030.parquet"))
simulated_census_2030.to_parquet(str(Path(output_dir) / f"simulated_census_2030.parquet"))

                                                                 

CPU times: user 559 ms, sys: 19.9 ms, total: 579 ms
Wall time: 533 ms
CPU times: user 430 ms, sys: 35.2 ms, total: 465 ms
Wall time: 434 ms


### Simulated SSA Numident

Wagner and Layne, p.4:

> The reference files are derived from the Social Security Administration
    (SSA) Numerical Identification file (SSA Numident). The Numident contains all
    transactions recorded against one Social Security Number (SSN)...

Based on the [SSA Numident through 2007 which is publicly available from NARA](https://aad.archives.gov/aad/series-description.jsp?s=5057),
we know there are three kinds of transactions: SSN applications, deaths, and claiming benefits.
SSN holders may change their information (e.g. changing name or sex) by submitting another application,
which generates an additional application transaction.
(The policies about this are found [on the SSA website](https://secure.ssa.gov/poms.nsf/lnx/0110212200).)

The paper ["Likely Transgender Individuals in U.S. Federal Administrative Records and the 2010 Census" by Benjamin Cerf Harris](https://www.census.gov/content/dam/Census/library/working-papers/2015/adrm/carra-wp-2015-03.pdf)
includes some helpful statistics (Table 2).
The average person in the SSA Numident has 2.2 transactions (called "claims" in that paper, but with the same definition
as our term "transaction": "Any time an SSN is created or information associated with an existing SSN is changed, that event is registered
as a claim.").

pseudopeople does not currently include correction, name change, or benefits claim transactions.
It only includes SSN creation and death of the SSN holder.

I've figured that there would be some delay in getting the Numident -- so by Census processing time
for the 2030 Census, only the SSA transactions by the end of 2029 would be available.
Note that with pseudopeople's current design it is only possible to set this cutoff at the end of a calendar year.
The NORC report says that "the Census NUMIDENT is recreated each year, to reflect
Social Security transaction records through **March** of each year" (p. 105),
though it isn't clear when in the year the Census Numident is actually re-created.

In [21]:
%%time

simulated_ssa_numident = psp.generate_social_security(
    year=2029,
    **psp_kwargs,
)

                                                                 

CPU times: user 399 ms, sys: 41.2 ms, total: 440 ms
Wall time: 412 ms




In [22]:
%%time

# This isn't a file that would really exist, but we just want a medium-sized SSA-like file.
# Subset SSA to only the simulants that ever filed taxes or were observed in the Census in
# Rhode Island.
if data_to_use == 'ri':
    simulated_ssa_numident = (
        simulated_ssa_numident
            .merge(simulants_ever_observed.assign(ever_observed_dummy=1), how="left", on="simulant_id")
            .pipe(lambda df: df[df.ever_observed_dummy == 1])
            .drop(columns=["ever_observed_dummy"])
    )

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs


In [23]:
%%time

utils.remove_path(str(Path(output_dir) / "simulated_ssa_numident.parquet"))
simulated_ssa_numident.to_parquet(str(Path(output_dir) / "simulated_ssa_numident.parquet"))

CPU times: user 53.6 ms, sys: 4.27 ms, total: 57.9 ms
Wall time: 66.6 ms


In [24]:
! date

Thu 08 Feb 2024 08:49:24 AM PST
